cache improvements james brock, joseph schmigel may 12, 2006 – computer architecture

Cache ImprovementsJames Brock, Joseph SchmigelMay 12, 2006 – Computer Architecture

Outline

Introduction

Reactive-Associative Caches

Non-Uniform Cache Architectures

Conclusion / References

Questions

Cache Problem Domains

Hit Time + Miss Rate * Miss Penalty

Hit Time Time to search and return data

Miss Rate Amount of times needed data is not

in cache and must be fetched from main-memory

Cache Latency Physical delay to move data from

cache to registers

Hit Time / Miss Rate

Searching for cache hits Using Set-Associative caches

causes hit times to increase greatly Multiple ways need to be checked

for a hit and then data in that way needs to be accessed

Miss Rate Direct-Caches have high miss rates Very small changes in miss rate can

effect performance greatly

Latency / Mapping

Latency Cache latency is a primary reason for

multiple layered, complex architectures

Very difficult to improve due to physical limitations

Mapping How data is mapped into cache

(associative, physical location) Better mapping heuristics can reduce

the average search time and latency

Effects of Cache Changes Power

More complex cache architectures will use more power to complete tasks

Time The more complex or larger in size a

cache, the slower it will be Real Estate

Complexity is directly proportional to the number and length of wire traces

Hits / Misses Each change to cache will impact the hit

time and miss rate in some way

Reactive-Associative CachesJoseph Schmigel

Reactive-Associative Caches Attempts to combine direct-mapped and

set-associative cache

Goal is to decrease miss rate while keeping hit times similar to direct-mapped

Avoid disadvantages of each Direct-mapped has high miss rate Set-associative has high hit time

Several major parts: Data array Tag array Probes Way Prediciton Feedback

Data Array & Tag Arrays The Data Array is the actual cache that

stores data

Data Array has two address mappings, one that is direct-mapped, and one that is set-associative (usually 2, 4, or 8 ways)

The tag array has n tag banks where n is the number of ways.

The tag array is used to store the tags of each set-associative index.

Each tag bank is searched in parallel.

Probes Two probes (Probe0 & Probe1) are used to

signal a hit.

Probe0 performs three steps in parallel Looks for a direct-mapped hit Uses way-predicted to find hit Finds hit in tag array

Probe0 tries to keep hit time equal to that of a direct-mapped hit time – only fails if has to use tag array

Probe1 is only used if Probe0 does not find a direct-mapped hit or way-predicted hit. It then returns a hit if there is a correct match in the set-associative cache.

Probes continued

This means that the following possibilities exist: Probe0 hits on direct-mapped and

Probe 1 is ignored Probe0 hits on way-prediction and

Probe1 is ignored Probe0 hits using tag array and the

Probe1 hits using way found from tag array

Probe0 misses and Probe1 is ignored

Way Prediction

Allows the block to be accessed without performing a tag lookup to obtain the way

Keeps hit times comparable to that of direct-mapped

Must be performed early enough so data can be ready in time for pipeline stage that needs it

Prediction can only use information that is currently available in pipeline

Two types of way prediction were used – XOR and Program Counter

XOR Way Prediction Calculates the approximate data access by

XOR’ing the register value with the instruction offset

Works by assuming that the small memory offsets that are pretty common can be XOR’ed and get a reliable block address to use as a prediction

Cannot be done until late in the pipeline because the registers need to be loaded before performing calculation

More accurate then program counter way prediction

Program Counter Way Prediction Associates parts of the cache with the

program counter

Not as accurate as XOR since the program counter does not access the same memory location all times

Program counter is calculated early in the pipeline so it is easier to make the predictions

Not as accurate as XOR

Feedback 3 types of feedback

Reactive displacement Eviction of unpredictable blocks Eviction of hard to predict blocks

Feedback tries to maximize bandwidth and minimize hit latency Highly predictable blocks are used in

the set-associative cache Blocks that can not be predicted

reliably are kept in direct-mapped cache

Non-Uniform Cache ArchitecturesJames Brock

Cache Organization Multiple Layer Cache

Hierarchical organization designed for faster accesses to layers of cache closer to the core

Replacement policies are static i.e. – Replacements cause one

insertion, one eviction at the same location in cache

Uniform Cache Cache Architecture is physically laid

out in uniformly distributed banks and sub-banks.

AMD64 Cache Design (K8 Core)

Problem Domain

CPU’s are becoming wire-delay dominated As the core speed of CPU’s

increases, the latency of transmission delays has a greater effect on overall performance

2 possible paths Reduce the latency of wire traces

(physical limitations) Use latency in the design, and

optimize

Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) All designs were modeled for L2

cache, but can be scaled to work as any layer

Uniform cache latency is as fast as the slowest bank

Non-uniformity in cache uses the latency of (sub)banks further from the decoder for better performance

S-NUCA Static means that the data in main

memory is mapped to 1 … n locations in cache, where n = associativity.

Solution 1: Static Non-Uniform Cache Architectures (S-NUCA)

S-NUCA1 S-NUCA2

Solution 1: Static Non-Uniform Cache Architectures (S-NUCA) S-NUCA 1

Individual data and address channels for each bank

Multiple banks can be accessed in parallel

HUGE real estate cost to add channels for each bank

S-NUCA 2 Mesh grid of data and address channels Switches at each intersection access

multiple sub-banks in parallel and arbitrate data flow

Solution 2: Dynamic NUCA(D-NUCA) Dynamic refers to the ranking and

movement of cache lines within the banks and sub-banks

Replacement policy is not a insert & evict Insertion, Demotion, Eviction based

on the replacement heuristic Example least recently used!

With D-NUCA, mapping, searching, and line movement problems expand

Suggested Mappings

D-NUCA Mapping & Searching Uses spread sets of banks # of banks in a set = associativity of

the cache Simple Mapping

Search by set, bank, tags within the set Some sets are further then others, rows

may not be desired number of ways Fair Mapping

Fixes problems in simple mapping, but more complex

Equal access times to all banks

D-NUCA Mapping & Searching Shared Mapping

Closest banks are shared with the farthest set

If n sets share a bank, then all banks in the cache are n-way associative

Slightly higher bank associativity offsets average access latency

Cache lines from farther bank sets are located right next to cache controller

D-NUCA Mapping & Searching Locating cache lines

Incremental Search – one bank at a time

Low power, less messages on cache network

Low performance Multicast Search – some/all banks

at the same time More power, more network contention Faster hits to farther banks

D-NUCA Mapping & Searching Hybrid Searches – combos!

Limited Multicast Multicast of M banks in each bank set

in parallel M < N

Partitioned Multicast Similar to multi-level set-associative

caches Each bank set is broken up into subsets Multicast searches are performed on

each subset, starting with the closest subset

D-NUCA Line Movement Goal of D-NUCA is to maximize

hits in the closest banks LRU policy is applied to mapping

lines within a bank MRU lines is closest to the cache

controller Replacement Policy –

Generational Promotion A cache hit causes that line to be

moved one line closer to the cache controller

D-NUCA Line Movement Generational Promotion (cont’d)

More heavily used lines, thus migrate towards the cache controller

Eviction / Insertion policy shouldn’t simply eject the LRU line and insert the new line in that spot

New lines are inserted towards the middle of the bank set, and allowed to progress forward or back

The victim line can be evicted or simply demoted, with a less important line being evicted.

Performance Improvement

Conclusion

Cache improvements are often more work then the benefits they offer Complexity causes speed decrease

which limits usefulness Implementing complex caching

structure does not usually provide a good cost/benefit ratio for companies

Research still being done and are useful in the theoretical world

References[1] Changkyu Kim, Doug Burger, Stephen

Keckler. \textbf{An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches}. Computer Architecture and Technology Laboratory, U of Texas, Austin.

[2] http://en.wikipedia.org/wiki/CPUcache[3] B. Batson, and TN. Vijaykumar. Reactive

associative caches. In Int. Conf. on Parallel Architectures and Compilation Techniques, Sep. 2001.

[4] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Apporach. Morgan Kaufmann, 2003. Third Edition, Chapter Five.

http://en.wikipedia.org/wiki/CPUcache

References[5] B. Calder, D. Grunwald, and J. Ermer.

Predictive sequential associative cache. In Proceedings of the Second IEEE Symposium on High-Performance Computer Architecture, Feb. 1996

[6] B. Calder, and D. Grunwald. Next cache line and set prediction. In Proceedings of the 20th International Symposium n Computer Architecture, June 1995

[7] A. Agarwal and S. Pudar. Column associative caches: A technique for reducing miss rate of direct-mapped caches. In Proceedings of the 20th International Symposium n Computer Architecture, May 1993

Questions/Comments

cache improvements james brock, joseph schmigel may 12, 2006 – computer architecture

Documents