memories: advanced cacheszshi/course/cse5302/files/memory1_caches.pdf · memories: advanced caches...
TRANSCRIPT
Memories: Advanced Caches
Z. Jerry ShiAssistant Professor of Computer Science and Engineering
University of ConnecticutUniversity of Connecticut
* Slides adapted from Blumrich&Gschwind/ELE475’03, Peh/ELE475’*
Speed gap
Q. How do architects address this gap?
A Put smaller faster “cache” memories between CPU
μProc:
A. Put smaller, faster cache memories between CPU and DRAM. Create a “memory hierarchy”.
60% / year2x in 1.5 years
DRAM: 7% / year
2 i 102x in 10 yrs
1977: DRAM faster than microprocessors
Apple ][ (1977)
CPU: 1000 nsCPU: 1000 nsDRAM: 400 ns
Steve WozniakSteve
Jobs
Levels of the Memory Hierarchy
CPU Registers
CapacityAccess TimeCost Staging
Xfer Unit
Upper Level
fasterCPU Registers100s Bytes<10s ns
CacheK B
Registers
Instr. Operands prog./compiler1-8 bytes
K Bytes10-100 ns1-0.1 cents/bit
M i M
Cache
Blocks cache cntl8-128 bytes
Main MemoryM Bytes200ns- 500ns0.0001-.00001 cents /bitDisk
Memory
Pages OS512-4K bytes
G Bytes, 10 ms (10,000,000 ns)
10-5- 10-6 cents/bit
Disk
Files user/operatorMbytes
Tapeinfinitesec-min10 -8
Tape
y
Lower LevelLarger
What is a cache?
• Small, fast storage used to improve average access time to slow memoryy
• Hold subset of the instructions and data used by program• Exploits spatial and temporal locality
– Temporal locality: If an item is referenced, it will tend to be referenced again soon
– Spatial locality: If an item is referenced nearby items will tend toSpatial locality: If an item is referenced, nearby items will tend to be referenced soon
• Last 15 years, HW relied on locality for speed– It is a property of programs which is exploited in machine design
Memory Hierarchy: Apple iMac G5
Managed by compiler
Managed by hardware
Managed by OS,hardware,
07 Reg L1 Inst L1 Data L2 DRAM Disk
by hardware hardware,application
iMac G5
07 Reg L1 Inst L1 Data L2 DRAM Disk
Size 1K 64K 32K 512K 256M 80G
Latency1 3 3 11 88 107
1.6 GHzy
Cycles, Time
1,0.6 ns
3,1.9 ns
3,1.9 ns
11,6.9 ns
88,55 ns
107,12 ms
G l Ill i f l f hLet programs address a memory space that
scales to the disk size at a speed that is usually
Goal: Illusion of large, fast, cheap memory
scales to the disk size, at a speed that is usually as fast as register access
Memory hierarchy specs
Type Capacity Latency BandwidthType Capacity Latency Bandwidth
Register <2KB 1ns 150GB/s
L1 Cache <64KB 4ns 50GB/s
L2 Cache <8MB 10ns 25GB/s
L3 Cache <64MB 20ns 10GB/s
Memory <4GB 50ns 4GB/sMemory <4GB 50ns 4GB/s
Disk >1GB 10ms 10MB/s
Programs with locality cache well ...
B d l lit b h iBad locality behavior
ddre
ss
acce
ss)
Temporal
mor
y A
ddo
t per
Locality
Mem
(one
SpatialLocality
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM Systems Journal 10(3): 168-192 (1971)
Time
Caches are everywhere
• In computer architecture, almost everything is a cache!– Registers “a cache” on variables – software managedRegisters a cache on variables software managed– First-level cache a cache on second-level cache– Second-level cache a cache on memory– Memory a cache on disk (virtual memory)– TLB a cache on page table– Branch-prediction a cache on prediction information– Branch-prediction a cache on prediction information– Branch Target Buffer a cache on the PC of instructions following
branches
Terminology
• Higher levels in the hierarchy are closer to the CPU• At each level a block is the minimum amount of data whoseAt each level a block is the minimum amount of data whose
presence is checked at each level– Blocks are also often called lines– Block size is always a power of 2– Block sizes of the lower levels are usually fixed multiples of
the block sizes of the higher levelsthe block sizes of the higher levels• A reference is said to hit at a particular level if the block is found
at that level– Hit Rate (HR) = Hits/References– Miss Rate (MR) = Misses/References
• Access time of a hit is the hit time
Terminology (2)
• The additional time to fetch a block from lower levels on a miss is called the miss penaltyon a miss is called the miss penalty– Time to replace a block in the upper level – Hit time << Miss penalty (500 instructions on 21264!)
• Miss Penalty = Access Time + Transfer Time– Access time is a function of latency
“Ti f d i ”• “Time for memory to get request and process it”– Transfer time is a function of bandwidth
• “Time for all of block to get back”
Miss Penalty Access Time
TransferTime
Penalty Access TimeBlock Size
Access vs. Transfer Time
Cache Memory
AA
BD
• Miss penalty = A + B + C
C
Miss penalty A B C– Access Time = A + B– Transfer Time to the upper level = C
• Transfer Time to the processor core = D
Latency vs. Bandwidth
• “There is an old network saying: Bandwidth problems can be cured with money. Latency problems are harder because the y y pspeed of light is fixed --- you can’t bribe God.”
» David Clark, MIT
• Recall pipelining: We improved instruction bandwidth but• Recall pipelining: We improved instruction bandwidth, but actually made latency worse
• With memory, for bandwidth we can:– Wider buses, larger block sizes, new DRAM organizations
(RAMBUS)• Latency is still much harder:• Latency is still much harder:
– Have to get request from cache to memory (off chip)– Have to do memory lookup– Have to have bits travel on wire back on-chip to cache
Caching Basics: Four Questions
• Block placement policy?– Where does a block go when it is fetched?Where does a block go when it is fetched?
• Block identification policy?– How do we find a block in the cache?
• Block replacement policy?– When fetching a block into a full cache, how do we decide what
other block gets kicked out?other block gets kicked out?• Write strategy?
– Does any of this differ for reads vs. writes?
Cache organization
• Direct mapped– A block has only one
l it iplace it can appear in the cache
• Fully associativeA block can be placed– A block can be placed anywhere in the cache
• n-way set associative– A block can be placedA block can be placed
in a restricted set of places
Simple Cache Example
• Direct-mapped cache: Each block has a specific spot in the cache. If it is in the cache, only one place for it, y p
• Makes block placement, ID, and replacement policies easy– Block Placement: Where does a block go when fetched?
• It goes to its one assigned spot– Block ID: How do we find a block in the cache?
• We look at the tags for that one assigned spotg g p– Block replacement: What gets kicked out?
• Whatever is in its assigned spotW i– Write strategy:
• “Allocate on write” More on write strategies later`
Fetching data from cache
state bits: valid, dirty, etc.Memory address
data blocktag
= ? MUX
Hit? Word
Memory address
0 3127 2818 190 3127 2818 19
4919
tag block address offsettag block address offset
Line size = 24 = 16 bytesLine size 2 16 bytes
Number of lines = 29 = 512
Number of bits in a line = 16 * 8 + 19 + 2 = 149
Number of bits in the cache = 149 * 512 = 76288
Direct Mapped Caches
• Partition Memory Address into three regions– C = Cache Size– M = Numbers of bits in memory address– B = Block Size
Tag Index Block OffsetM - log C log C/B log B
Tag Index Block Offset
Tag Memory
Data Memory
= Hit/Miss D tData
Set Associative Caches
• Partition Memory Address into three regions– C = Cache Size, B=Block Size, A=number of members per set, , p– A block can be placed any of the A locations in a set
M-log C/A log C/(B*A) log BTag Index Block Offset
Tag Memoryway0 way1
Data Memory=
way0 way1
=Hit/Miss
=
OR DataHit/MissOR Data
Cache size example
• 32-bit machine• 64KB 32B Block 2-Way Set Associative64KB, 32B Block, 2 Way Set Associative• Compute Total Size of Tag Array
– 64KB/ 32B blocks => 2K Blocks– 2K Blocks / 2-way set-associative => 1K Sets– 32B Blocks => 5 Offset Bits
1K S t > 10 i d bit– 1K Sets => 10 index bits– 32-bit address – 5 offset bits – 10 index bits = 17 tag bits– 17 tag bits * 2K Blocks => 34Kb => 4.25KBg
Summary of set associativity
• Direct Mapped– One place in cache, One Comparator, No Muxes
• Set Associative Caches– Restricted set of places
n way set associativity– n-way set associativity– Number of comparators = number of blocks per set– n:1 MUX
• Fully Associative– Anywhere in cache– Number of comparators = number of blocks in cacheNumber of comparators number of blocks in cache– n:1 MUX needed
Direct-mapped Cache Example
• 8 locations in our cache• Block Size = 1 byte
Lower order bits Block Size 1 byte
• Data reference stream:References to memory locations:
0 00001 00012 0010y
– 0,1,2,3,4,5,6,7,8,9,0,0,0,2,2,2,4,9,1,9,1• Entry = address mod cache size
2 00103 00114 01005 01016 01107 01117 01118 10009 1001
Cache Examples: Cycles 1 - 5
0 1 2 3 4 5 6 7 8 9 0 0 0 2 2 2 4 9 1 9 10,1,2,3,4,5,6,7,8,9,0,0,0,2,2,2,4,9,1,9,1
Miss Miss Miss Miss Miss0 0
1012
012
012
3 34
Cache Examples: Cycles 6-10
0 1 2 3 4 5 6 7 8 9 0 0 0 2 2 2 4 9 1 9 10,1,2,3,4,5,6,7,8,9,0,0,0,2,2,2,4,9,1,9,1
Miss Miss Miss Miss Miss012
012
012
812
892
345
345
345
345
3455 5
6567
567
567
Cache Examples: Cycles 11-15
0 1 2 3 4 5 6 7 8 9 0 0 0 2 2 2 4 9 1 9 10,1,2,3,4,5,6,7,8,9,0,0,0,2,2,2,4,9,1,9,1
Miss Hit (0) Hit (0) Hit (2) Hit (2)092
092
092
092
092
345
345
345
345
3455
67
567
567
567
567
Cache Examples: Cycles 16-21
0 1 2 3 4 5 6 7 8 9 0 0 0 2 2 2 4 9 1 9 10,1,2,3,4,5,6,7,8,9,0,0,0,2,2,2,4,9,1,9,1
Hit(2) Hit (4) Hit(9) Miss Miss Miss092
092
092
012
092
012
345
345
345
345
345
3455
67
567
567
567
567
567
Example summary
• Hit Rate = 7/21 = 1/3• Miss Rate = 14/21 = 2/3Miss Rate 14/21 2/3
Now, what if the block size = 2 bytes?yEntry = (Address / Block Size) mod Cache Size
How about 2-way set associative cache?
H b t f ll i ti h ?How about fully associative cache?
Cache misses: three C’s
• Compulsory– The first access to a block can NOT be in the cacheThe first access to a block can NOT be in the cache– Also called cold start misses or first reference misses– Misses even in an infinite cache
• Capacity– The cache is too small to hold all the blocks needed by a program
Blocks being discarded and later retrieved– Blocks being discarded and later retrieved– Misses in fully associative cache
• Conflict– The cache has sufficient space, but a block can not be kept because
of the conflict with another blockAlso called collision misses or interference misses– Also called collision misses or interference misses
Summary: Block Placement + ID
• Placement– Invariant: block always goes in exactly one setInvariant: block always goes in exactly one set– Fully-Associative: Cache is one set, block goes anywhere– Direct-Mapped: Block goes in exactly one frame– Set-Associative: Block goes in one of a few frames
• IdentificationFind Set– Find Set
– Search ways in parallel (compare tags, check valid bits)
Block Replacement
• Cache miss requires a replacement• No decision needed in direct mapped cacheNo decision needed in direct mapped cache• More than one place for memory blocks in set-associative
– Easy for direct mapped• Replacement Strategies
– OptimalR l Bl k d f th t h d i ti ( l )• Replace Block used furthest ahead in time (oracle)
– Least Recently Used (LRU)• Optimized for temporal locality
– (Pseudo) Random• Nearly as good as LRU, simpler
Replacement Strategy
A randomly chosen block?Easy to implement, how
The Least Recently Used (LRU) block? Appealing, y p ,
well does it work?(LRU) block? Appealing,but hard to implement for high associativity
Miss Rate for 2-way Set Associative CacheAlso,Size Random LRU
tryotherLRU
16 KB 5.7% 5.2%
64 KB 2.0% 1.9%approx.
256 KB 1.17% 1.15%
Miss rate
Assoc: 2-way 4-way 8-waySize LRU Ran LRU Ran LRU Ran16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0%64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5%256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12%
Write Policies
• Writes are only about 21% of data cache traffic• Optimize cache for reads do writes “on the side”Optimize cache for reads, do writes on the side
– Reads can do tag check/data read in parallel– Writes must be sure we are updating the correct data and the correct
amount of data (1-8 byte writes)– Serial process => slow
• What to do on a write hit?What to do on a write hit?• What to do on a write miss?
Write Hit Policies
• Q1: When to propagate new values to memory?• Write back – Information is only written to the cacheWrite back Information is only written to the cache.
– Next lower level only updated when it is evicted (dirty bits say when data has been modified)
– Can write at speed of cache – Caches become temporarily inconsistent with lower-levels of
hierarchy. y– Uses less memory bandwidth/power (multiple consecutive writes
may require only 1 final write)M ltiple rites ithin a block can be merged into one rite– Multiple writes within a block can be merged into one write
– Evictions are longer latency now (must write back)
Write Hit Policies
• Q1: When to propagate new values to memory?• Write through – Information is written to cache and to the lower-Write through Information is written to cache and to the lower
level memory– Main memory is always “consistent/coherent”– Easier to implement – no dirty bits– Reads never result in writes to lower levels (cheaper)– Higher bandwidth needed– Higher bandwidth needed– Write buffers used to avoid write stalls
Write buffers
• Small chunks of memory to buffer outgoing writesCPU to buffer outgoing writes
• Processor can continue when data written to
CPU
when data written to buffer
• Allows overlap of Cache Write Buffer ows ove p o
processor execution with memory updateLower Levels of MemoryLower Levels of Memory
W it b ff ti l f it th h hWrite buffers are essential for write-through caches
Write buffers
• Writes can now be pipelined (rather than serial)Check tag + Write store data into Write Buffer– Check tag + Write store data into Write Buffer
– Write data from Write buffer to L2 cache (tags ok)• Loads must check write buffer for St O• Loads must check write buffer for
pending stores to same address• Loads Check:
Address| DataStore Op
• Loads Check:– Write Buffer– Cache
Address| DataWrite Buffer Entry
Cache– Subsequent Levels of Memory Tag Data
Data Cache
Write buffer policies: Performance/Complexity Tradeoffs
Stores L2 CacheL2 Cache
Loads
• Allow merging of multiple stores? “coalescing”• Flush Policy – How to do flushing of entries?
L d S i i P li Wh t h h l d t• Load Servicing Policy – What happens when a load occurs to data currently in write buffer?
Write merging
Non-merging Buffer• Except for multi-word
write operations, extra slots are unused
Merging Write Buffer• More efficient writes
R d b ff f ll• Reduces buffer-full stalls
Write Buffer Flush Policies
• When to flush?– Aggressive flushing:Aggressive flushing:
Reduce chance of stall cycles due to full write buffer– Conservative flushing:
Write merging more likely (entries stay around longer) reducesWrite merging more likely (entries stay around longer) reduces memory traffic
– On-chip L2’s More aggressive flushing• What to flush?
– Selective flushing of particular entries?Fl sh e er thing belo a partic lar entr– Flush everything below a particular entry
– Flush everything
Write Buffer Load Service Policies
• Load op’s address matches something in write buffer• Possible policies:Possible policies:
– Flush entire write buffer, service load from L2– Flush write buffer up to and including relevant address, service
from L2– Flush only the relevant address from write buffer, service from L2– Service load from write buffer don’t flushService load from write buffer, don t flush
• What if a Read miss doesn’t hit in the Write buffer?– Give priority for the Read L2 accesses over the Write L2 Accesses
Write misses?
• Write Allocate– Block is allocated on a write missBlock is allocated on a write miss– Standard write hit actions follow the block allocation– Write misses = Read Misses– Goes well with write-back
• No-write AllocateWrite misses do not allocate a block– Write misses do not allocate a block
– Only update lower-level memory– Blocks only allocate on Read misses!– Goes well with write-through
Summary of Write Policies
Write Policy Hit/Miss Writes to
WriteBack/Allocate Both L1 Cache
WriteBack/NoAllocate Hit L1 CacheWriteBack/NoAllocate Hit L1 Cache
WriteBack/NoAllocate Miss L2 Cache
WriteThrough/Allocate Both Both
WriteThrough/NoAllocate Hit Both
WriteThrough/NoAllocate Miss L2 Cache
Cache performance
• Average memory access time (AMAT):AMAT = Hit Time + Miss Rate × Miss PenaltyAMAT Hit Time + Miss Rate × Miss Penalty
Misleading indicator of cache performance is miss rateExample:
i i 1 i l 20 i 0 1Hit Time = 1ns, Miss Penalty = 20ns, Miss Rate = 0.1AMAT = 1 + 0.1 * 20 = 3ns
• CPU time = (CPU execution cycles + Memory Stall Cycles) × Cycle TimeCPU time (CPU execution cycles + Memory Stall Cycles) Cycle Time
• To improve cache performance, we can1. Reduce the miss rate, 2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.
• Out-of-order processors can hide some of the miss penalty
1. Reduce Miss Rate via Larger Block Size
20%
25%
1K
Miss Rate 10%
15%4K
16K
5%
10%64K
256K
0%
16 32 64
128
256
Block Size (bytes) Miss penaltyCompulsory (spatial locality)Conflict? (fewer blocks in cache)Capacity? (fewer blocks in cache)
2: Reduce Miss Rate via Larger Cache
0.12
0.14 1-wayConflict
0.08
0.12-way
4-way
8
• Old rule of thumb: 2x size => 25% cut in miss rate
• What does it reduce?
0.04
0.068-way
Capacity
• What does it reduce?
0
0.02
1 2 4 8 6 2 4 8
Cache Size (KB)
1 2 4 8
16 32 64
128
Compulsory
Hit time (wire delay)Cost
2. Reduce Miss Rate via Larger Cache
100%1
80%1-way
2-way4-way
Conflict
40%
60%y
8-way
Capacity
20%
Capacity
0%
1 2 4 8
16 32 64
128
Cache Size (KB)
1
Compulsory
3. Reduce Miss Rate via Higher Associativity
• 2:1 Cache Rule: Miss Rate of a Direct-Mapped cache of size N pp= Miss Rate of a 2-way cache of size N/2
• Works up to 128K...
B E ti ti i th l fi l !• Beware: Execution time is the only final measure!– Will Clock Cycle time increase?– Hill [1988] suggested hit time for 2-way vs. 1-way [ ] gg y y
external cache +10%, internal + 2%
Example: AMAT vs. Miss Rate
• Assume cycle time vs. DM is 1.10 for 2-way, 1.12 for 4-way, and 1.14 for 8-way
Cache Size Associativity(KB) 1-way 2-way 4-way 8-way
1 2.33 2.15 2.07 2.011 2.33 2.15 2.07 2.012 1.98 1.86 1.76 1.684 1.72 1.67 1.61 1.538 1 46 1 48 1 47 1 438 1.46 1.48 1.47 1.4316 1.29 1.32 1.32 1.3232 1.20 1.24 1.25 1.2764 1.14 1.20 1.21 1.23128 1.10 1.17 1.18 1.20
(Blue means A.M.A.T. not improved by more associativity)
4. Reduce Miss Rate via Way Prediction and Pseudoassociativity
• How to combine fast hit time of Direct Mapped and have the lower conflict misses of 2-way SA cache? y
• Way Prediction: Predict which way will hit– 2-way: 50% chance
• Data Multiplexer set in advance– If wrong, check every block
• Example: 21264• Example: 21264– 1 cycle if right– 3 cycles if wrongy g– 85% accuracy on SPEC95
• Pseudoassociativity: On a miss, check the other half of the cache– Invert the highest bit of the index
Pseudoassociativity
• Divide cache: on a miss, check other half of cache to see if there (invert most significant index bit), if so have a pseudo-hit (slow hit)
Hit Time
Pseudo Hit Time Miss Penalty
D b k CPU i li i h d if hit t k 1 2 l
Time
• Drawback: CPU pipeline is hard if hit takes 1 or 2 cycles– Better for caches not tied directly to processor (L2)– Used in MIPS R10000 L2 cache, similar in UltraSPARC
5. Reduce Miss Rate by Hardware Prefetching of Instructions & Data
• Example of instruction prefetching– Alpha 21064 fetches 2 blocks on a miss: the requested one is placed in
cache and the prefetched block is in “stream buffer”– If the requested instruction is in the stream buffer, cancel the original
request to the cache– 1 stream buffer catches 15% to 25% of the misses from a 4KB direct-
mapped instruction cache (Jouppi 1990). Four got 50%• Works with data blocks too:
– 1 data stream buffer got 25% misses from 4KB cache (Jouppi 1990); 4 buffers got 43%
– Palacharla & Kessler [1994] for scientific programs forPalacharla & Kessler [1994] for scientific programs for – 8 streams got 50% to 70% of misses from two 64KB, 4-way set
associative caches for scientific programs (Palacharla et al 1994)• Prefetching relies on having extra memory bandwidth that can be used without• Prefetching relies on having extra memory bandwidth that can be used without
penalty• Cache blocks are a form of prefetching
6. Reduce Miss Rate by Software Prefetching of Data
• Prefetching comes in two flavors:– Register prefetch will load the value into a register (PA-RISC)Register prefetch will load the value into a register (PA RISC)
• Must be correct address and register• The value is bound at the time the prefetch is performed
C h f t h L d i t h (MIPS IV P PC t )– Cache prefetch Load into cache (MIPS IV, PowerPC, etc.)• Can be incorrect. Frees HW/SW to guess!• The data value is not bound until the real load operation
• Either of these can be faulting or nonfaulting– Nonfaulting prefetches turn into no-ops if they would result in an
exceptionexception– A normal load is a “faulting register prefetch” instruction
• Issuing Prefetch Instructions takes timeg– Is cost of prefetch issues < savings in reduced misses?– Higher superscalar reduces difficulty of issue bandwidth
Software prefetch example
for (i=0;i<3;i++)for (j=0;j<100,j++)
a[i][j] = b[j][0] * b[j+1][0]
All elements 8 bytes longAll elements 8 bytes longCache: 8KB direct mapped, 16 byte blocks, write allocateArray a[3][100]: 3*100*8 bytes = 2400BArray b[101][3]: 3*101*8 bytes = 2424B What happens at each Array a access?a[0][0] a[0][1] a[0][2] a[0][3]a[0][0], a[0][1], a[0][2], a[0][3], …What happens for Array b?b[0][0], b[1][0], b[1][0], b[2][0], … for each i
Code with software prefetching
For (j=0;j<100;j++) { //prefetch setupprefetch(b[j+7][0]); //for 7 iterations laterprefetch(a[0][j+7]);a[0][j] = b[j][0] * b[j+1][0];
}For (i=1;i<3;i++) //use the prefetch
for (j=0;j<100;j++) {prefetch(a[i][j+7]);a[i][j] = b[j][0] * b[j+1][0];
}
Number of misses? 19 misses vs. 251
Number of prefetch instructions?
7. Reduce Miss Rate by Compiler Optimizations
• McFarling [1989] reduced caches misses by 75% on 8KB direct mapped cache, using software!pp , g
• Instructions– Reorder procedures in memory so as to reduce conflict misses– Profiling to look at conflicts (using tools they developed)
• DataMerging Arrays: improve spatial locality by single array of– Merging Arrays: improve spatial locality by single array of compound elements vs. 2 arrays
– Loop Interchange: change nesting of loops to access data in order d istored in memory
– Loop Fusion: Combine 2 independent loops that have same looping and some variables overlapp
– Blocking: Improve temporal locality by accessing “blocks” of data repeatedly vs. going down whole columns or rows
Padding and Offset Changes
• Cache size of 8KB
Before:int a[2048];
After:int a[2064];int b[2064];int b[2048];
int c[2048];
int b[2064];int c[2064];
for(i=0;i<2048;i++)C[i]=A[i]+B[i]
for(i=0;i<2048;i++)C[i]=A[i]+B[i]
Programmer should do this!
Merging Arrays
/* Before: 2 sequential arrays */int val[SIZE];int key[SIZE];
/* After: 1 array of stuctures */struct merge {struct merge {
int val;int key;
};};struct merge merged_array[SIZE];
Reduce conflicts between val & key; improve spatial locality
Loop Interchange
/* Before */for (j = 0; j < 100; j = j+1)for (j 0; j < 100; j j+1)
for (i = 0; i < 5000; i = i+1)x[i][j] = 2 * x[i][j];
/* After *// After /for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)x[i][j] = 2 * x[i][j];x[i][j] = 2 * x[i][j];
/* C: Row-major order */
Sequential accesses instead of striding through memory every 100 words; improved spatial locality
Loop Fusion
/* Before */for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
a[i][j] = 1/b[i][j] * c[i][j];for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1)
d[i][j] = a[i][j] + c[i][j];/* After */for (i = 0; i < N; i = i+1)for (j = 0; j < N; j = j+1){ a[i][j] = 1/b[i][j] * c[i][j];
d[i][j] = a[i][j] + c[i][j];}
2 misses per access to a & c vs one miss per access;2 misses per access to a & c vs. one miss per access; improve spatial locality
Blocking
/* Before */for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)for (j 0; j < N; j j+1){r = 0;for (k = 0; k < N; k = k+1){
r = r + y[i][k]*z[k][j];};[i][j]x[i][j] = r;
};
• Two Inner Loops:Read all NxN elements of z[]– Read all NxN elements of z[]
– Read N elements of 1 row of y[] repeatedly– Write N elements of 1 row of x[]
• Capacity Misses a function of N & Cache Size:– 2N3 + N2 => (assuming no conflict; otherwise …)
• Idea: compute on BxB submatrix that fits in the cache (reduce capacityIdea: compute on BxB submatrix that fits in the cache (reduce capacity misses)
Blocking
/* After */for (jj = 0; jj < N; jj = jj+B)(jj jj jj jj )for (kk = 0; kk < N; kk = kk+B)for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B 1 N); j = j+1)for (j = jj; j < min(jj+B-1,N); j = j+1){r = 0;
for (k = kk; k < min(kk+B-1,N); k = k+1) {
r = r + y[i][k]*z[k][j];};x[i][j] = x[i][j] + r;
};};
• B called Blocking Factor• Capacity Misses from 2N3 + N2 to N3/B+2N2
Reducing Conflict Misses by Blocking
0.1
0.05 Direct Mapped Cache
Fully Associative Cache
00 50 100 150
• Conflict misses in non-FA-caches vs. Blocking sizeLam et al [1991] a blocking factor of 24 had a fifth the
Blocking Factor
– Lam et al [1991] a blocking factor of 24 had a fifth the misses vs. 48 despite both fitting in cache
Summary of Compiler Optimizations to Reduce Cache Misses (by hand)
t ( 7)
vpenta (nasa7)
btrix (nasa7)
tomcatv
gmty (nasa7)
h l kspice
mxm (nasa7)
btrix (nasa7)
compress
cholesky(nasa7)
Performance Improvement
1 1.5 2 2.5 3
mergedarrays
loopinterchange
loop fusion blocking
Summary: Miss Rate Reduction
CPUtime = IC × CPI Execution +Memory accesses
Instruction× Miss rate × Miss penalty⎛
⎝ ⎞ ⎠ × Clock cycle time
• 3 Cs: Compulsory, Capacity, Conflict1. Reduce Misses via Larger Block Size2 R d Mi i Hi h A i ti it2. Reduce Misses via Higher Associativity3. Reduce Misses via Way Prediction and Pseudoassociativity5. Reduce Misses by HW Prefetching Instr, Data6. Reduce Misses by SW Prefetching Data7. Reduce Misses by Compiler Optimizations
Cache Optimization Summary
Techniques MR MP HT ComplexityLarger Block Size + – 0Larger Caches + – 1M
iss Rat
gHigher Associativity + – 1Way Prediction + 2Pseudoassociative Caches + 2HW P f t hi f I t /D t + + 2/3e HW Prefetching of Instr/Data + + 2/3Compiler Controlled Prefetching + + 3Compiler Reduce Misses + 0
Reducing Miss Penalty
• Have already seen two examples of techniques to reduce miss penaltyp y– Write buffers give priority to read misses over writes– Merging write buffers
l i d i f h i l d i• Multiword writes are faster than many single word writes
• Now we consider several more– Victim CachesVictim Caches– Critical Word First/Early Restart– Multilevel caches– Non-blocking cache– Load bypassing/forwarding
Write Buffer Refresher
• Usually many more reads than writes– 20% of data cache traffic20% of data cache traffic– 7% of overall traffic (I + D)
• Read stalls until data arrives; write stalls until data is taken– Writes can be buffered and written later
• What about RAW?D i th b ff– Drain the buffer
– Search the buffer• Write-allocate
– Turn write miss into read miss, then write
1. Reduce Miss Penalty: Read Priority
• Naive write-back cache:– Read miss requiring evictionRead miss requiring eviction– Write dirty victim block to memory– Read new block from memory– Return read to CPU– Miss penalty = 2 memory accesses
• Improvement:• Improvement:– Read miss requiring eviction– Move dirty victim block to write buffer while reading new block
from memory– Return read to CPU
Empty write buffer when memory is available– Empty write buffer when memory is available– Miss penalty = 1 memory access
2. Reduce Miss Penalty: Early Restart and Critical Word First
• CPU normally just needs one word at a time• Large cache blocks have long transfer timesg g• Don’t wait for full block to be loaded before restarting CPU
– Early restart — As soon as the requested word of the block arrives, send it to the CPU and let the CPU continuesend it to the CPU and let the CPU continue
– Critical Word First — Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block Alsoexecution while filling the rest of the words in the block. Also called wrapped fetch and requested word first.
• Generally useful only with large blocks
block
3. Reduce Miss Penalty: Load bypassing/forwarding
Load bypassing Load forwarding
Store X Store X
Store Y Store Y
Load Z (Execute this before the Stores)
Load X (Forward Store X)Stores)
Four Possibilities for Load/Store Motion
Load-LoadLW R1, (R2)
• Load-Load can always be interchanged (if no volatiles)L d St d St StLW R3, (R4) • Load-Store and Store-Store are never interchanged
• Store-Load is the only promising program transformation
Load-StoreLW R1 (R2) program transformation
– Load is done earlier than planned, which can only help
– Store is done later than planned,
LW R1, (R2)SW (R3), R4
Store Storep ,
which should cause no harm• Two variants of transformation
– If load is independent of store, we
Store-StoreSW R1, (R2)SW R3, (R4)
have load bypassing– If load is dependent on store
through memory (e.g., (R1) == (R4)) h l d f di
Store-LoadSW (R1), R2LW R3, (R4) (R4)), we have load forwardingLW R3, (R4)
Reduce Miss Penalty via Load bypassing/forwarding
• In-order issue of loads/stores -> Out-of-order issue
• Problem: Load may alias with in-transit store that’s not yet in store buffer
• Solution: Finished load buffer -> complete in program order. If alias detected, reissue load
4. Reduce Miss Penalty: Non-blocking Caches
• Non-blocking cache or lockup-free cache allows data cache to continue to supply cache hits during a miss
i F/E bit i t t f d ti– requires F/E bits on registers or out-of-order execution– requires multi-bank memories– “hit under miss” reduces the effective miss penalty by working during
miss vs. ignoring CPU requests• “hit under multiple miss” or “miss under miss” may further lower the
effective miss penalty by overlapping multiple misses– Significantly increases the complexity of the cache controller as there can
be multiple outstanding memory accesses– Requires the main memory can service multiple missesq y p– Pentium Pro allows 4 outstanding memory misses
Non-blocking caches
• Dual-ported, non-blocking cache
• Non-blocking: Missed load queue stores missed load instructions
5. Reducing Miss Penalty: Victim Caches
• Direct mapped caches => many conflict misses• Solution 1: More associativity (expensive)Solution 1: More associativity (expensive)• Solution 2: Victim Cache• Victim Cache
– Small (4 to 8-entry), fully-associative cache between L1 cache and refill pathHolds blocks discarded from cache because of evictions– Holds blocks discarded from cache because of evictions
– Checked on a miss before going to L2 cache– Hit in victim cache => swap victim block with cache block– 4-entry victim cache removed 20% to 95% of conflicts for a 4 KB
direct mapped data cache (Jouppi [1990])Used in Alpha HP machines– Used in Alpha, HP machines
Victim Cache
• Even one entry helps some benchmarks!
• Helps more for smaller caches larger blockcaches, larger block sizes
6: Reduce Miss Penalty via a Second-Level Cache
• Q: Why a second cache level?– A: same reason as before!A: same reason as before!
• L1 is cache for memory; L2 is cache for L1• L1 can be small and fast to match the clock cycle time of the CPU
L2 i l t t th t ld t i• L2 is large to capture many accesses that would go to main memory
• Considerations– L2 not tied to CPU
• Can have its own clock• Likely to be set-associative
L2 d i f tl b l– L2 accessed infrequently: can be more clever– L2 is not free ($$$)– L2 increases miss penalty for DRAM accessesp y
Second-Level Cache Performance
• L2 EquationsAMAT = Hit TimeL1 + Miss RateL1 × Miss PenaltyL1Miss Penalty = Hit Time + Miss Rate × Miss PenaltyMiss PenaltyL1 = Hit TimeL2 + Miss RateL2 × Miss PenaltyL2AMAT = Hit TimeL1 +
Miss RateL1 × (Hit TimeL2 + Miss RateL2 × Miss PenaltyL2)
• Definitions:– Local miss rate — misses in this cache divided by the total number of
memory accesses to this cache (Miss rateL2)y ( L2)• Number of misses / Number of refs to cache• Local miss rate of L2 cache is usually not very good because most locality has
been filtered out by the L1 CacheGl b l i i hi h di id d b h l b f– Global miss rate — misses in this cache divided by the total number of memory accesses generated by the CPU (Miss RateL1 × Miss RateL2)
• Number of misses / Number of refs of CPU• Global Miss Rate is what matters• Global Miss Rate is what matters
Comparing Local and Global Miss Rates
• Global miss rate close to single level cache rate provided L2 >> L1 (32KB)• Local cache rate is not a good measure of secondary cache
Another Graph To Stare At...
Base performance is 8192 byte L2 cache with a hit time of 1 cycleWhat do you see?y
L2 Cache Performance
• For L2 Caches– Low latency, high bandwidth is less importantLow latency, high bandwidth is less important– Low miss rate is very important– Why?
• L2 Caches design for– Unified (I+D)
Larger Size (4 8MB) at the expense of latency– Larger Size (4-8MB) at the expense of latency– Larger block sizes (128Byte lines!) – High associativity: 4, 8, 16 at the expense of latency
Multilevel inclusion
• Inclusion is said to hold between level i and level i+1 if all data in level i is also in level i+1
• Desirable because for I/O and multiprocessors only have to keep 2nd level consistent
• Difficult when different block sizes at various levels– When a block in L2 is to be replaced, all related blocks in L1 have
to be invalidated A slightly higher miss rateo be v d ed s g y g e ss e• Multilevel exclusion: L1 data is never found in L2
– Do not waste space to keep multiple copies– L1 misses result in a swap of blocks between L1 and L2– AMD Athlon has only 256 KB L2 cache (two 64KB L1 cache)
Summary: Miss Penalty Reduction
CPUtime = IC × CPI Execution +Memory accesses
Instruction× Miss rate × Miss penalty⎛
⎝ ⎞ ⎠ × Clock cycle time
• Five techniques– Read priority over write on miss
E l R t t d C iti l W d Fi t– Early Restart and Critical Word First– Victim Caches– Non-blocking Caches (Hit under Miss, Miss under Miss)
S d L l C h– Second Level Cache• Can be applied recursively to Multilevel Caches
– Danger is that time to DRAM will grow with multiple levels in betweenFi L2 h k hi i i d– First attempts at L2 caches can make things worse, since increased worst case is worse
Cache Optimization Summary
Techniques MR MP HT ComplexityLarger Block Size + – 0Larger Caches + – 1M
iss Ra
Larger Caches + – 1Higher Associativity + – 1Way Prediction + 2Pseudoassociative Caches + 2te HW Prefetching of Instr/Data + + 2/3Compiler Controlled Prefetching + + 3Compiler Reduce Misses + 0Priority to Read Misses + 1
Miss Pen
Priority to Read Misses 1Early Restart & Critical Word 1st + 2Victim cache + + 2Load bypassing/forwarding + 3nalty
Non-Blocking Caches + 3Second Level Caches + 2Write buffer merging + 1
Improving Cache Performance
1. Reduce the miss rate, 2 Reduce the miss penalty or2. Reduce the miss penalty, or3. Reduce the time to hit in the cache.
i lii iA A yMissPenaltMissRateHitTimeAMAT ×+=
1. Reduce Hit time via Small and Simple Caches
• Always need to read the tag array– Direct mapping: can read data in parallel (why?)Direct mapping: can read data in parallel (why?)
• Speed up external L2 by keeping tags on-chip
2. Reduce Hit Time by Avoiding Address Translation
• Conventional organization:
– translate on every access
CPU TLB $ MEMVA PA PA
• Virtually addressed cache:
CPU $ TLB MEMVA VA PA
– translate only on a miss– Looks great! But...
Virtually Addressed Cache
• Aliasing: Different virtual addresses can map to the same physical addressp y– H/W can check all aliases and invalidate on a miss
• Example: 21264 uses 64K I$ with 8K page, so there are 8 possible aliasesaliases.
– Page coloring: set-associative mapping applied to VM page mapping, so that aliases match in their low-order address bits
• Aliases will always have the same index and offset
Tag Index Offset VAPA
012301223
Virtually Addressed Cache
• Multiprocessing– Processes use the same virtual addresses for their own physicalProcesses use the same virtual addresses for their own physical
addresses• Flush the cache on a context switch
Ouch! Cost = flush + compulsory misses– Ouch! Cost = flush + compulsory misses• Extend tag with Process Identifier (PID)
– Page-Level protection is enforced during VA->PA address l itranslation
• Copy protection info from TLB to cache on miss– Adds a datapath
– Input/Output direct memory access• DMA uses physical addresses. How to invalidate?
Virtually Indexed, Physically Tagged Cache
• Best of both worlds?– Start cache read immediatelyStart cache read immediately– Only need PA for tag compare
CPU$
TLBMEM
VA PA
• But index+offset limited to VM page size– Increase cache size through higher associativity
3. Reduce Hit Time by Pipelining Cache Access
• Pipeline access split into multiple cycles so faster cycle time• Increase throughput rather than decrease latencyIncrease throughput rather than decrease latency
4. Reduce Hit Time With A Trace Cache
• Predict instruction sequences and fetch them into the instruction cache– Fold branch predictor into cache– Fragmentation: small basic blocks not aligned to large cache blocks– Multiple copies: same instruction can be part of several traces
Cache Optimization Summary
Techniques MR MP HT ComplexityLarger Block Size + – 0Larger Caches + – 1M
iss Ra
Larger Caches + – 1Higher Associativity + – 1Way Prediction + 2Pseudoassociative Caches + 2te HW Prefetching of Instr/Data + + 2/3Compiler Controlled Prefetching + + 3Compiler Reduce Misses + 0Priority to Read Misses + 1
Miss Pen
Priority to Read Misses 1Early Restart & Critical Word 1st + 2Victim cache + + 2Load bypassing/forwarding + 3nalty
Non-Blocking Caches + 3Second Level Caches + 2Write buffer merging + 1Small & Simple Caches – + 0H
itTim
e
Small & Simple Caches + 0Avoiding Address Translation + 2Pipelining Caches + 1