cache design and tricks presenters: kevin leung josh gilkerson albert kalim shaz husain

58
Cache Design and Tricks Presenters: Presenters: Kevin Leung Kevin Leung Josh Gilkerson Josh Gilkerson Albert Kalim Albert Kalim Shaz Husain Shaz Husain

Upload: maximo-choyce

Post on 14-Dec-2015

221 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Cache Design and TricksPresenters: Presenters:

Kevin LeungKevin LeungJosh GilkersonJosh Gilkerson

Albert KalimAlbert KalimShaz HusainShaz Husain

Page 2: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

What is Cache ?

A A cachecache is simply a copy of a small data is simply a copy of a small data segment residing in the main memorysegment residing in the main memory

Fast but small extra memoryFast but small extra memory Hold identical copies of main memoryHold identical copies of main memory Lower latencyLower latency Higher bandwidthHigher bandwidth Usually several levels (1, 2 and 3)Usually several levels (1, 2 and 3)

Page 3: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Why cache is important?

Old days: CPUs clock frequency was the primary Old days: CPUs clock frequency was the primary performance indicator. performance indicator.

Microprocessor execution speeds are improving at Microprocessor execution speeds are improving at a rate of 50%-80% per year while DRAM access a rate of 50%-80% per year while DRAM access times are improving at only 5%-10% per year.times are improving at only 5%-10% per year.

If the same microprocessor operating at the same If the same microprocessor operating at the same frequency, system performance will then be a frequency, system performance will then be a function of memory and I/O to satisfy the data function of memory and I/O to satisfy the data requirements of the CPU. requirements of the CPU.

Page 4: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

There are three types of cache that are now being used: There are three types of cache that are now being used: One on-chip with the processor, referred to as the One on-chip with the processor, referred to as the

"Level-1" cache (L1) or primary cache"Level-1" cache (L1) or primary cache Another is on-die cache in the SRAM is the "Level 2" Another is on-die cache in the SRAM is the "Level 2"

cache (L2) or secondary cache. cache (L2) or secondary cache. L3 Cache L3 Cache

PCs and Servers, Workstations each use different cache PCs and Servers, Workstations each use different cache architectures: architectures: PCs use an asynchronous cachePCs use an asynchronous cache Servers and workstations rely on synchronous cache Servers and workstations rely on synchronous cache Super workstations rely on pipelined caching Super workstations rely on pipelined caching

architecturesarchitectures. .

Types of Cache and Its Architecture:

Page 5: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Alpha Cache Configuration

Page 6: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

General Memory Hierarchy

Page 7: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Cache Performance

Cache performance can be measured by counting wait-states for cache Cache performance can be measured by counting wait-states for cache burst accesses. burst accesses.

When one address is supplied by the microprocessor and four When one address is supplied by the microprocessor and four addresses worth of data are transferred either to or from the cache.addresses worth of data are transferred either to or from the cache.

Cache access wait-states are occur when CPUs wait for slower cache Cache access wait-states are occur when CPUs wait for slower cache subsystems to respond to access requests.subsystems to respond to access requests.

Depending on the clock speed of the central processor, it takes Depending on the clock speed of the central processor, it takes 5 to 10 ns to access data in an on-chip cache, 5 to 10 ns to access data in an on-chip cache, 15 to 20 ns to access data in SRAM cache, 15 to 20 ns to access data in SRAM cache, 60 to 70 ns to access DRAM based main memory, 60 to 70 ns to access DRAM based main memory, 12 to 16 ms to access disk storage. 12 to 16 ms to access disk storage.

Page 8: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Cache Issues

Latency and Bandwidth – two metrics associated with caches and Latency and Bandwidth – two metrics associated with caches and memorymemory

LatencyLatency: time for memory to respond to a read (or write) request is : time for memory to respond to a read (or write) request is too longtoo long CPU ~ 0.5 ns (light travels 15cm in vacuum)CPU ~ 0.5 ns (light travels 15cm in vacuum) Memory ~ 50 ns Memory ~ 50 ns

BandwidthBandwidth: number of bytes which can be read (written) per second: number of bytes which can be read (written) per second CPUs with 1 GFLOPCPUs with 1 GFLOPSS peak performance standard: needs 24 peak performance standard: needs 24

Gbyte/sec bandwidthGbyte/sec bandwidth Present CPUs have peak bandwidthPresent CPUs have peak bandwidth <5 Gbyte/sec and much less in <5 Gbyte/sec and much less in

practicepractice

Page 9: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Cache Issues (continued)

Memory requests are satisfied fromMemory requests are satisfied from Fast cache (if it holds the appropriate Fast cache (if it holds the appropriate

copy): copy): Cache HitCache Hit Slow main memory (if data is not in Slow main memory (if data is not in

cache): cache): Cache MissCache Miss

Page 10: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

How Cache is Used?

Cache contains copies of some of Main MemoryCache contains copies of some of Main Memory those storage locations recently usedthose storage locations recently used

when Main Memory address A is referenced in CPUwhen Main Memory address A is referenced in CPU cache checked for a copy of contents of Acache checked for a copy of contents of A

if found, cache hitif found, cache hit copy usedcopy used no need to access Main Memoryno need to access Main Memory

if not found, cache missif not found, cache miss Main Memory accessed to get contents of AMain Memory accessed to get contents of A copy of contents also loaded into cachecopy of contents also loaded into cache

Page 11: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Progression of Cache

Before 80386, DRAM is still faster than the CPU, Before 80386, DRAM is still faster than the CPU, so no cache is used.so no cache is used. 4004: 4Kb main memory. 8008: (1971) : 16Kb main memory. 8080: (1973) : 64Kb main memory. 8085: (1977) : 64Kb main memory. 8086: (1978) 8088 (1979) : 1Mb main

memory. 80286: (1983) : 16Mb main memory.

Page 12: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Progression of Cache (continued)

80386: (1986) 80386SX:

Can access up to 4Gb main memory start using external cache, 16Mb through a 16-bit data bus and 24 bit address bus.

80486: (1989) 80486DX:

Start introducing internal L1 Cache. 8Kb L1 Cache. Can use external L2 Cache

Pentium: (1993) 32-bit microprocessor, 64-bit data bus and 32-bit address

bus 16KB L1 cache (split instruction/data: 8KB each). Can use external L2 Cache

Page 13: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Progression of Cache (continued)

Pentium Pro: (1995) 32-bit microprocessor, 64-bit data bus and 36-bit

address bus. 64Gb main memory. 16KB L1 cache (split instruction/data: 8KB each). 256KB L2 cache.

Pentium II: (1997) 32-bit microprocessor, 64-bit data bus and 36-bit

address bus. 64Gb main memory. 32KB split instruction/data L1 caches (16KB each). Module integrated 512KB L2 cache (133MHz). (on Slot)

Page 14: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Progression of Cache (continued)

Pentium III: (1999) 32-bit microprocessor, 64-bit data bus and 36-bit

address bus. 64GB main memory. 32KB split instruction/data L1 caches (16KB each). On-chip 256KB L2 cache (at-speed). (can up to 1MB) Dual Independent Bus (simultaneous L2 and system

memory access). Pentium IV and recent:

L1 = 8 KB, 4-way, line size = 64 L2 = 256 KB, 8-way, line size = 128 L2 Cache can increase up to 2MB

Page 15: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Progression of Cache (continued)

Intel Itanium: L1 = 16 KB, 4-way L2 = 96 KB, 6-way L3: off-chip, size varies

Intel Itanium2 (McKinley / Madison): L1 = 16 / 32 KB L2 = 256 / 256 KB L3: 1.5 or 3 / 6 MB

Page 16: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Cache Optimization

General PrinciplesGeneral Principles Spatial LocalitySpatial Locality Temporal LocalityTemporal Locality

Common TechniquesCommon Techniques Instruction ReorderingInstruction Reordering Modifying Memory Access PatternsModifying Memory Access Patterns

Many of these examples have been adapted from the ones used by Dr. Many of these examples have been adapted from the ones used by Dr. C.C. Douglas et al in previous presentations.C.C. Douglas et al in previous presentations.

Page 17: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Optimization Principles

In general, optimizing cache usage is an In general, optimizing cache usage is an exercise in taking advantage of locality.exercise in taking advantage of locality.

2 types of locality2 types of locality spatialspatial temporaltemporal

Page 18: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Spatial Locality

Spatial locality refers to accesses close to one another Spatial locality refers to accesses close to one another in position.in position.

Spatial locality is important to the caching system Spatial locality is important to the caching system because contiguous cache lines are loaded from because contiguous cache lines are loaded from memory when the first piece of that line is loaded.memory when the first piece of that line is loaded.

Subsequent accesses within the same cache line are Subsequent accesses within the same cache line are then practically free until the line is flushed from the then practically free until the line is flushed from the cache.cache.

Spatial locality is not only an issue in the cache, but Spatial locality is not only an issue in the cache, but also within most main memory systems.also within most main memory systems.

Page 19: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Temporal Locality

Temporal locality refers to 2 accesses to a Temporal locality refers to 2 accesses to a piece of memory within a small period of piece of memory within a small period of time.time.

The shorter the time between the first and The shorter the time between the first and last access to a memory location the less last access to a memory location the less likely it will be loaded from main memory likely it will be loaded from main memory or slower caches multiple times.or slower caches multiple times.

Page 20: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Optimization Techniques

PrefetchingPrefetching Software PipeliningSoftware Pipelining Loop blockingLoop blocking Loop unrollingLoop unrolling Loop fusionLoop fusion Array paddingArray padding Array mergingArray merging

Page 21: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Prefetching

Many architectures include a prefetch Many architectures include a prefetch instruction that is a hint to the processor that instruction that is a hint to the processor that a value will be needed from memory soon.a value will be needed from memory soon.

When the memory access pattern is well When the memory access pattern is well defined and the programmer knows many defined and the programmer knows many instructions ahead of time, prefetching will instructions ahead of time, prefetching will result in very fast access when the data is result in very fast access when the data is needed.needed.

Page 22: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Prefetching (continued)

It does no good to prefetch variables that It does no good to prefetch variables that will only be written to.will only be written to.

The prefetch should be done as early as The prefetch should be done as early as possible. Getting values from memory possible. Getting values from memory takes a LONG time.takes a LONG time.

Prefetching too early, however will mean Prefetching too early, however will mean that other accesses might flush the that other accesses might flush the prefetched data from the cache.prefetched data from the cache.

Memory accesses may take 50 processor Memory accesses may take 50 processor clock cycles or more.clock cycles or more.

for(i=0;i<n;++i){a[i]=b[i]*c[i];prefetch(b[i+1]);prefetch(c[i+1]);//more code

}

Page 23: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Software Pipelining

Takes advantage of pipelined processor Takes advantage of pipelined processor architectures.architectures.

Affects similar to prefetching.Affects similar to prefetching. Order instructions so that values that are Order instructions so that values that are

“cold” are accessed first, so their memory “cold” are accessed first, so their memory loads will be in the pipeline and instructions loads will be in the pipeline and instructions involving “hot” values can complete while involving “hot” values can complete while the earlier ones are waiting.the earlier ones are waiting.

Page 24: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Software Pipelining (continued)

These two codes accomplish These two codes accomplish the same tasks.the same tasks.

The second, however uses The second, however uses software pipelining to fetch software pipelining to fetch the needed data from main the needed data from main memory earlier, so that later memory earlier, so that later instructions that use the data instructions that use the data will spend less time stalled.will spend less time stalled.

for(i=0;i<n;++i){a[i]=b[i]+c[i];

}II

se=b[0];te=c[0];for(i=0;i<n-1;++i){

so=b[i+1];to=b[i+1];a[i]+=se+te;se=so;te=to;

}a[n-1]+=so+to;

Page 25: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Loop Blocking

Reorder loop iteration so as to operate on all Reorder loop iteration so as to operate on all the data in a cache line at once, so it needs the data in a cache line at once, so it needs only to be brought in from memory once.only to be brought in from memory once.

For instance if an algorithm calls for For instance if an algorithm calls for iterating down the columns of an array in a iterating down the columns of an array in a row-major language, do multiple columns at row-major language, do multiple columns at a time. The number of columns should be a time. The number of columns should be chosen to equal a cache line.chosen to equal a cache line.

Page 26: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Loop Blocking (continued)

These codes perform a These codes perform a straightforward matrix straightforward matrix multiplication r=z*b.multiplication r=z*b.

The second code takes The second code takes advantage of spatial advantage of spatial locality by operating locality by operating on entire cache lines at on entire cache lines at once instead of once instead of elements.elements.

// r has been set to 0 previously.// line size is 4*sizeof(a[0][0]).

Ifor(i=0;i<n;++i) for(j=0;j<n;++j) for(k=0;k<n;++k)

r[i][j]+=a[i][k]*b[k][j];

IIfor(i=0;i<n;++i) for(j=0;j<n;j+=4) for(k=0;k<n;++k) for(l=0;l<4;++l)

for(m=0;m<4;++m) r[i][j+l]+=a[i][k+m]* b[k+m][j+l];

Page 27: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Loop Unrolling

Loop unrolling is a technique that is used in Loop unrolling is a technique that is used in many different optimizations.many different optimizations.

As related to cache, loop unrolling As related to cache, loop unrolling sometimes allows more effective use of sometimes allows more effective use of software pipelining.software pipelining.

Page 28: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Loop Fusion

Combine loops that Combine loops that access the same data.access the same data.

Leads to a single load Leads to a single load of each memory of each memory address.address.

In the code to the left, In the code to the left, version II will result in version II will result in N fewer loads.N fewer loads.

Ifor(i=0;i<n;++i)

a[i]+=b[i];for(i=0;i<n;++i)

a[i]+=c[i];

IIfor(i=0;i<n;++i)

a[i]+=b[i]+c[i];

Page 29: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Array Padding

Arrange accesses to avoid Arrange accesses to avoid subsequent access to subsequent access to different data that may be different data that may be cached in the same position.cached in the same position.

In a 1-associative cache, the In a 1-associative cache, the first example to the left will first example to the left will result in 2 cache misses per result in 2 cache misses per iteration.iteration.

While the second will cause While the second will cause only 2 cache misses per 4 only 2 cache misses per 4 iterations.iterations.

//cache size is 1M//line size is 32 bytes//double is 8 bytes

Iint size = 1024*1024;double a[size],b[size];for(i=0;i<size;++i){

a[i]+=b[i];}

IIint size = 1024*1024;double a[size],pad[4],b[size];for(i=0;i<size;++i){

a[i]+=b[i];}

Page 30: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Array Merging

Merge arrays so that Merge arrays so that data that needs to be data that needs to be accessed at once is accessed at once is stored togetherstored together

Can be done using Can be done using struct(II) or some struct(II) or some appropriate appropriate addressing into a addressing into a single large single large array(III).array(III).

double a[n], b[n], c[n];for(i=0;i<n;++i)

a[i]=b[i]*c[i];II

struct { double a,b,c; } data[n];for(i=0;i<n;++i)

data[i].a=data[i].b*data[i].c;III

double data[3*n];for(i=0;i<3*n;i+=3)

data[i]=data[i+1]*data[i+2];

Page 31: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Pitfalls and Gotchas

Basically, the pitfalls of memory access patterns Basically, the pitfalls of memory access patterns are the inverse of the strategies for optimization.are the inverse of the strategies for optimization.

There are also some gotchas that are unrelated to There are also some gotchas that are unrelated to these techniques.these techniques. The associativity of the cache.The associativity of the cache. Shared memory.Shared memory.

Sometimes an algorithm is just not cache friendly.Sometimes an algorithm is just not cache friendly.

Page 32: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Problems From Associativity

When this problem shows itself is highly When this problem shows itself is highly dependent on the cache hardware being used.dependent on the cache hardware being used.

It does not exist in fully associative caches.It does not exist in fully associative caches. The simplest case to explain is a 1-associative The simplest case to explain is a 1-associative

cache.cache. If the stride between addresses is a multiple of the If the stride between addresses is a multiple of the

cache size, only one cache position will be used.cache size, only one cache position will be used.

Page 33: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Shared Memory

It is obvious that shared memory with high It is obvious that shared memory with high contention cannot be effectively cached.contention cannot be effectively cached.

However it is not so obvious that unshared However it is not so obvious that unshared memory that is close to memory accessed memory that is close to memory accessed by another processor is also problematic.by another processor is also problematic.

When laying out data, complete cache lines When laying out data, complete cache lines should be considered a single location and should be considered a single location and should not be shared.should not be shared.

Page 34: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Optimization Wrapup

Only try once the best algorithm has been Only try once the best algorithm has been selected. Cache optimizations will not selected. Cache optimizations will not result in an asymptotic speedup.result in an asymptotic speedup.

If the problem is too large to fit in memory If the problem is too large to fit in memory or in memory local to a compute node, or in memory local to a compute node, many of these techniques may be applied to many of these techniques may be applied to speed up accesses to even more remote speed up accesses to even more remote storage.storage.

Page 35: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Case Study: Cache Design forEmbedded Real-Time Systems

Based on the paper presented at the Based on the paper presented at the Embedded Systems Conference, Summer Embedded Systems Conference, Summer 1999, by Bruce Jacob, ECE @ University 1999, by Bruce Jacob, ECE @ University of Maryland at College Park.of Maryland at College Park.

Page 36: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Case Study (continued)

Cache is good for embedded hardware Cache is good for embedded hardware architectures but ill-suited for software architectures but ill-suited for software architectures.architectures.

Real-time systems disable caching and Real-time systems disable caching and schedule tasks based on worst-case memory schedule tasks based on worst-case memory access time.access time.

Page 37: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Case Study (continued)

Software-managed caches: benefit of Software-managed caches: benefit of caching without the real-time drawbacks of caching without the real-time drawbacks of hardware-managed caches.hardware-managed caches.

Two primary examples: DSP-style (Digital Two primary examples: DSP-style (Digital Signal Processor) on-chip RAM and Signal Processor) on-chip RAM and Software-managed Virtual Cache. Software-managed Virtual Cache.

Page 38: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

DSP-style on-chip RAM

Forms a separate namespace from main Forms a separate namespace from main memory.memory.

Instructions and data only appear in Instructions and data only appear in memory if software explicit moves them to memory if software explicit moves them to the memory.the memory.

Page 39: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

DSP-style on-chip RAM (continued)

DSP-style SRAM in a distinct namespace separate from main memory

Page 40: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

DSP-style on-chip RAM (continued) Suppose that the memory areas have the Suppose that the memory areas have the

following sizes and correspond to the following sizes and correspond to the following ranges in the address space:following ranges in the address space:

Page 41: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

DSP-style on-chip RAM (continued) If a system designer wants a certain function that If a system designer wants a certain function that

is initially held in ROM to be located in the very is initially held in ROM to be located in the very beginning of the SRAM-1 array:beginning of the SRAM-1 array:

void function();void function();

char *from = function; // in range 4000-5FFFchar *from = function; // in range 4000-5FFF

char *to = 0x1000; // start of SRAM-1 arraychar *to = 0x1000; // start of SRAM-1 array

memcpy(to, from, FUNCTION_SIZE);memcpy(to, from, FUNCTION_SIZE);

Page 42: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

DSP-style on-chip RAM (continued) This software-managed cache organization This software-managed cache organization

works because DSPs typically do not use works because DSPs typically do not use virtual memory. What does this mean? Is virtual memory. What does this mean? Is this “safe”?this “safe”?

Current trend: Embedded systems to look Current trend: Embedded systems to look increasingly like desktop systems: address-increasingly like desktop systems: address-space protection will be a future issue.space protection will be a future issue.

Page 43: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Software-Managed Virtual Caches

Make software responsible for cache-fill and Make software responsible for cache-fill and decouple the translation hardware. How? decouple the translation hardware. How?

Answer: Use Answer: Use upcalls upcalls to the software that happen to the software that happen on cache misses: every cache miss would interrupt on cache misses: every cache miss would interrupt the software and vector to a handler that fetches the software and vector to a handler that fetches the referenced data and places it into the cache.the referenced data and places it into the cache.

Page 44: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Software-Managed Virtual Caches (continued)

The use of software-managed virtual caches in a real-time system

Page 45: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Software-Managed Virtual Caches (continued) Execution without cache: access is slow to every location Execution without cache: access is slow to every location

in the system’s address space.in the system’s address space. Execution with hardware-managed cache: statistically fast Execution with hardware-managed cache: statistically fast

access time.access time. Execution with software-managed cache: Execution with software-managed cache:

* software determines what can and cannot be cached.* software determines what can and cannot be cached.* access to any specific memory is consistent (either * access to any specific memory is consistent (either

always in cache or never in cache).always in cache or never in cache).* faster speed: selected data accesses and instructions * faster speed: selected data accesses and instructions

execute 10-100 times faster.execute 10-100 times faster.

Page 46: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Cache in Future

Performance determined by memory system Performance determined by memory system speedspeed

Prediction and Prefetching techniquePrediction and Prefetching technique Changes to memory architectureChanges to memory architecture

Page 47: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Prediction and Prefetching

Two main problems need be solvedTwo main problems need be solved Memory bandwidth (DRAM, RAMBUS)Memory bandwidth (DRAM, RAMBUS) Latency (RAMBUS AND DRAM-60 ns)Latency (RAMBUS AND DRAM-60 ns) For each access, following access is stored For each access, following access is stored

in memory.in memory.

Page 48: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Issues with Prefetching

Accesses follow no strict patternsAccesses follow no strict patterns

Access table may be hugeAccess table may be huge

Prediction must be speedyPrediction must be speedy

Page 49: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Issues with Prefetching (continued)

Predict block addressed instead of Predict block addressed instead of individual ones.individual ones.

Make requests as large as the cache lineMake requests as large as the cache line

Store multiple guesses per block.Store multiple guesses per block.

Page 50: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

The Architecture

On-chip Prefetch BuffersOn-chip Prefetch Buffers Prediction & PrefetchingPrediction & Prefetching Address clustersAddress clusters Block PrefetchBlock Prefetch Prediction CachePrediction Cache Method of PredictionMethod of Prediction Memory InterleaveMemory Interleave

Page 51: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Effectiveness

Substantially reduced access time for large Substantially reduced access time for large scale programs.scale programs.

Repeated large data structures.Repeated large data structures. Limited to one prediction scheme.Limited to one prediction scheme. Can we predict the future 2-3 accesses ? Can we predict the future 2-3 accesses ?

Page 52: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Summary Importance of CacheImportance of Cache

System performance from past to presentSystem performance from past to present Gone from CPU speed to memoryGone from CPU speed to memory

The youth of CacheThe youth of Cache L1 to L2 and now L3L1 to L2 and now L3

Optimization techniques.Optimization techniques. Can be trickyCan be tricky Applied to access remote storageApplied to access remote storage

Page 53: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

Summary Continued …

Software and hardware based CacheSoftware and hardware based Cache Software - consistent, and fast for certain Software - consistent, and fast for certain

accessesaccesses Hardware – not so consistent, no or less Hardware – not so consistent, no or less

control over decision to cachecontrol over decision to cache

AMD announces Dual Core technology ‘05AMD announces Dual Core technology ‘05

Page 54: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

References

Websites:Websites:Computer WorldComputer Worldhttp://www.computerworld.com/http://www.computerworld.com/

Intel CorporationIntel Corporationhttp://www.intel.com/http://www.intel.com/

SLCentralSLCentralhttp://www.slcentral.com/http://www.slcentral.com/

Page 55: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

References (continued)

Publications: [1] Thomas Alexander. A Distributed Predictive Cache for High

Performance Computer Systems. PhD thesis, Duke University, 1995.

[2] O.L. Astrachan and M.E. Stickel. Caching and lemmatizing in model elimination theorem provers. In Proceedings of the Eleventh International Conference on Automated Deduction. Springer Verlag, 1992.

[3] J.L Baer and T.F Chen. An effective on chip preloading scheme to reduce data access penalty. SuperComputing `91, 1991.

[4] A. Borg and D.W. Wall. Generation and analysis of very long address traces. 17th ISCA, 5 1990.

[5] J. V. Briner, J. L. Ellis, and G. Kedem. Breaking the Barrier of Parallel Simulation of Digital Systems. Proc. 28th Design Automation Conf., 6, 1991.

Page 56: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

References (continued)

Publications: [6] H.O Bugge, E.H. Kristiansen, and B.O Bakka. Trace-driven

simulation for a two-level cache design on the open bus system. 17th ISCA, 5 1990.

[7] Tien-Fu Chen and J.-L. Baer. A performance study of software and hardware data prefetching scheme. Proceedings of 21 International Symposium on Computer Architecture, 1994.

[8] R.F Cmelik and D. Keppel. SHADE: A fast instruction set simulator for execution proling Sun Microsystems, 1993.

[9] K.I. Farkas, N.P. Jouppi, and P. Chow. How useful are non-blocking loads, stream buers and speculative execution in multiple issue processors. Proceedings of 1995 1st IEEE Symposium on High Performance Computer Architecture, 1995.

Page 57: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

References (continued)

Publications: [10] J.W.C. Fu, J.H. Patel, and B.L. Janssens. Stride directed prefetching in

scalar processors . SIG-MICRO Newsletter vol.23, no.1-2 p.102-10 , 12 1992.

[11] E. H. Gornish. Adaptive and Integrated Data Cache Prefetching for Shared-Memory Multiprocessors. PhD thesis, University of Illinois at Urbana-Champaign, 1995.

[12] M.S. Lam. Locality optimizations for parallel machines . Proceedings of International Conference on Parallel Processing: CONPAR '94, 1994.

[13] M.S Lam, E.E. Rothberg, and M.E. Wolf. The cache performance and optimization of block algorithms. ASPLOS IV, 4 1991.

[14] MCNC. Open Architecture Silicon Implementation Software User Manual. MCNC, 1991.

[15] T.C. Mowry, M.S Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm for Prefetching. ASPLOS V, 1992.

Page 58: Cache Design and Tricks Presenters: Kevin Leung Josh Gilkerson Albert Kalim Shaz Husain

References (continued)

Publications: [16] Betty Prince. Memory in the fast lane. IEEE Spectrum, 2 1994.

[17] Ramtron. Speciality Memory Products. Ramtron, 1995.

[18] A. J. Smith. Cache memories. Computing Surveys, 9 1982.

[19] The SPARC Architecture Manual, 1992.

[20] W. Wang and J. Baer. Efficient trace-driven simulation methods for cache performance analysis. ACM Transactions on Computer Systems, 8 1991.

[21] Wm. A. Wulf and Sally A. McKee. Hitting the MemoryWall: Implications of the Obvious . Computer Architecture News, 12 1994.