lec12 caches performance comp architecture

59
Lecture 12: Improving Cache Performance Iakovos Mavroidis Computer Science Department University of Crete

Upload: programming-passion

Post on 15-Jul-2015

61 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: Lec12 caches performance comp architecture

Lecture 12: Improving Cache Performance

Iakovos Mavroidis

Computer Science Department

University of Crete

Page 2: Lec12 caches performance comp architecture

Classification of Cache Optimizations

Page 3: Lec12 caches performance comp architecture

Common Advanced Caching Optimizations

Page 4: Lec12 caches performance comp architecture

Classification of Cache Optimizations

• Reduce Miss Penalty

• Reduce Miss Rate

• Reduce Hit Time

Page 5: Lec12 caches performance comp architecture

Multi-core Processor

Page 6: Lec12 caches performance comp architecture

1.Multi-level Caches

Page 7: Lec12 caches performance comp architecture

Multilevel caches

L1

Page 8: Lec12 caches performance comp architecture

Multilevel cache miss rates

C

Page 9: Lec12 caches performance comp architecture

AMAT in multilevel caches

Page 10: Lec12 caches performance comp architecture

Stalls in multilevel caches

Page 11: Lec12 caches performance comp architecture

L2 cache performance implications

Normalized to 8K KB, 1 clock cycle hit L2 cache

Page 12: Lec12 caches performance comp architecture

Inclusion Property

AMD Athlon supports exclusive caches

Pentium 4 has not constraints (accidentally inclusive)

Page 13: Lec12 caches performance comp architecture

2. Critical word first and early restart

Page 14: Lec12 caches performance comp architecture

Write buffer & Victim Cache

Page 15: Lec12 caches performance comp architecture

3.Giving priority to read misses over writes

Page 16: Lec12 caches performance comp architecture

Giving priority to read misses over writes

all desktop and server processors give reads priority over writes.

(aka write-back buffer)

Page 17: Lec12 caches performance comp architecture

Merging write buffer

Page 18: Lec12 caches performance comp architecture

Merging write buffer

Page 19: Lec12 caches performance comp architecture

4.Victim Cache

a four-entry victim cache might remove one quarter of the misses in

a 4-KB direct-mapped data cache.

Page 20: Lec12 caches performance comp architecture

5.Non-blocking or Lookup Free Caches

Out-of-order pipelines already have this functionality built in… (load queues, etc).

Page 21: Lec12 caches performance comp architecture

Potential of Non-blocking Caches

Page 22: Lec12 caches performance comp architecture

Miss Status Handling Register

Page 23: Lec12 caches performance comp architecture

Non-blocking Caches : Operation

Page 24: Lec12 caches performance comp architecture

6.Multi-ported Caches

Page 25: Lec12 caches performance comp architecture

True Multi-porting

Page 26: Lec12 caches performance comp architecture

Multi-banked Caches

Page 27: Lec12 caches performance comp architecture

Sun UltraSPARC T2 8-bank L2 cache

Page 28: Lec12 caches performance comp architecture

Classification of Cache Optimizations

• Reduce Miss Penalty

• Reduce Miss Rate

• Reduce Hit Time

Page 29: Lec12 caches performance comp architecture

3 C’s model

Page 30: Lec12 caches performance comp architecture

Associativity and conflict misses

Mis

s r

ate

• Compulsory misses are those that occur in an infinite cache

• Capacity misses are those that occur in a fully associative cache

• Conflict misses are those that occur going from fully associative to 8-way associative, 4-way associative, and so on

Page 31: Lec12 caches performance comp architecture

2 to 1 cache rule

miss rate 1-way associative cache of size X =

miss rate 2-way associative cache of size X/2

Mis

s r

ate

Page 32: Lec12 caches performance comp architecture

Miss rate distribution

Mis

s r

ate

per

type

• Associativity tends to increase in modern caches (for example 8-way L1 and 16-way L3)

• Increased associativity may result in complex design and slow clock

Page 33: Lec12 caches performance comp architecture

7.Increasing block size

Page 34: Lec12 caches performance comp architecture

Miss rate versus block size

Page 35: Lec12 caches performance comp architecture

AMAT versus block size

Page 36: Lec12 caches performance comp architecture

8. Larger Caches

Page 37: Lec12 caches performance comp architecture

9.Increasing Associativity

Page 38: Lec12 caches performance comp architecture

Increasing Associativity

Page 39: Lec12 caches performance comp architecture

AMAT versus Associativity

Miss rates from Computer Architecture book

Page 40: Lec12 caches performance comp architecture

10.Way Prediction

Page 41: Lec12 caches performance comp architecture

Pseudoassociativity

Page 42: Lec12 caches performance comp architecture

11.Prefetching

Page 43: Lec12 caches performance comp architecture

Software Prefetching

Page 44: Lec12 caches performance comp architecture

Hardware Prefetching

Page 45: Lec12 caches performance comp architecture

Simple Sequential Prefetching

Page 46: Lec12 caches performance comp architecture

Stream Prefetching

Page 47: Lec12 caches performance comp architecture

Stream Buffer Design

Page 48: Lec12 caches performance comp architecture

Stream Buffer Design

Page 49: Lec12 caches performance comp architecture

Strided Prefething

Page 50: Lec12 caches performance comp architecture

Sandybridge Prefetching (Intel Core i7-2600K)

Page 51: Lec12 caches performance comp architecture

Other Ideas in Prefetching

Page 52: Lec12 caches performance comp architecture

12.Compiler Optimizations

Page 53: Lec12 caches performance comp architecture

Array merging

Page 54: Lec12 caches performance comp architecture

Loop Interchange

• Exchange the nesting of loops taking advantage of spatial locality. Maximize use of a cache block before it is replaced.

Page 55: Lec12 caches performance comp architecture

Data blocking

Page 56: Lec12 caches performance comp architecture

Data blocking

memory accesses

Total required cache space to exploit locality = N2(for D) + N(for Y)

Page 57: Lec12 caches performance comp architecture

Data blocking

memory accesses

Total required cache space to exploit locality = B2(for D) + B(for Y)

Page 58: Lec12 caches performance comp architecture

Data blocking

Page 59: Lec12 caches performance comp architecture

Classification of Cache Optimizations

• Reduce Miss Penalty

• Reduce Miss Rate

• Reduce Hit Time?

– Small and Simple Caches

– Virtually Addressed Caches

– Pipelined Caches

– Trace Caches