software methods to increase data cache performance presented by philip marshall

Software Methods to Increase Data Cache Performance

Presented by Philip Marshall

Outline

Introduction Example: Multiple Vector Additions Example: Linked List Example: Binary Tree Conclusion

Introduction

Cache hit time is critical to system performance Often determines a processor’s clock period Cache controllers must be as simple as possible

The miss rate of a cache can be decreased if we know something about the access patterns

If we use software to use better access patterns or hint at how the cache can best be used, we can improve performance

Introduction

Various methods can be used: Loop Fusion – combine multiple loops that access the

same elements Array Merge – combine multiple arrays to increase

spatial locality Cache Prefetch – ask for values to be loaded into

cache in advance Cache Bypass – prevent certain accesses from

allocating in the cache

Vector Addition – Base Code

#define SIZE_N 1024

int a[SIZE_N], b[SIZE_N], c[SIZE_N];int s1[SIZE_N], s2[SIZE_N];

for (int i = 0; i < SIZE_N; i++) s1[i] = a[i] + b[i];

for (int i = 0; i < SIZE_N; i++) s2[i] = a[i] + c[i];

Vector Addition – Base Code

Assume a perfect instruction cache Ignore conflict data misses Assume a cache line size of 4 words Assume write miss penalties can be hidden First loop:

a, b: 256 misses each (every 4th access) Second loop:

a, c: 256 misses each unless cache is large enough to hold entire a and b arrays

1024 total misses

Vector Addition – Loop Fusion

#define SIZE_N 1024int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N];

for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];}

Vector Addition – Loop Fusion

a, b, c: 256 misses each 768 total misses Are there always loops that can be

combined?

Vector Addition – Array Merge

#define SIZE_N 1024struct vectors_type{ int a; int b; int c;}int s1[SIZE_N], s2[SIZE_N];

vectors_type vectors[SIZE_N];

for (int i = 0; i < SIZE_N; i++){ s1[i] = vectors[i].a + vectors[i].b; s2[i] = vectors[i].a + vectors[i].c;}

Vector Addition – Array Merge

3072 accesses, every 4th one misses 768 misses May not be a viable optimization method in

all cases If we have a large set of vectors and want to

be able to add any twoDynamic memory allocationWhat if we only want to traverse one vector?

Vector Addition – Prefetch

Speculatively load data into cache before we need it

Useful if we know which data we need far enough in advance

Assume prefetch is useful if we know the address 10 iterations in advance

Assume prefetch past end of array is non-faulting


#define SIZE_N 1024int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N],

s2[SIZE_N]

for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];

prefetch(a[i+10]); prefetch(b[i+10]); prefetch(c[i+10]);}


Only 30 misses 3072 prefetch instructions issued Does the cost outweigh the benefit?

768 – 30 = 738 fewer misses Miss cost only needs to be 4.2 cycles for prefetch be

worthwhile Multiple issue processors can help hide the cost of

issuing prefetches Improves performance even if we’re only adding

2 vectors


Do we want a special load instruction that prefetches several blocks ahead?Reduces instruction countWorks in the case of sequential access, but

what if we want to prefetch from non-contiguous locations?

Vector Addition – Cache Bypass

Assume a 2-set fully associative cache with 4 word line size

for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];}

Assume write non-allocate Very worst case: cache always misses (4096 misses) If we use LRU and write our assembly so that a is

always in cache: 2048 misses for b[i] and c[i] + 256 misses for a[i] = 2304

If we use non-caching reads for c[i]: 1024 misses a[i] and b[i] 256 misses each: 1536 total

Linked List

Suppose we are sequentially traversing a linked list

We can prefetch the next several items Calculating addresses repeatedly could be

expensive (requires multiple memory accesses)

Use 2 pointers: one for prefetch

Linked List – Base Codestruct linked{ int data; *linked next;}*linked start;

*linked temp = start;int a[SIZE], index=0;while (temp->next){ a[index++] = temp->data; temp = temp->next;}

Linked List – Prefetchstruct linked{ int data; *linked next;}*linked start;

*linked temp = start, temp2 = start;int a[SIZE], index=0;for (int i = 0; i < 10; i++) temp2 = temp2->next;

while (temp->next){ a[index++] = temp->data; temp = temp->next; if (temp2->next){ temp2 = temp2->next; prefetch(temp2->next); }}

Linked List – Prefetch

Instead of every element potentially missing the cache, only the first 10 do

If prefetch takes longer to complete, more cache space is necessary

Binary Tree

Suppose we are traversing a binary tree where we can’t easily predict which branch we’ll access next. Is prefetch useful?

We can speculatively prefetch all values How far down tree? Cache Pollution

May be valuable to speculatively fetch next two possible elements if we can do useful work until the prefetch completes (ie, if it takes enough cycles to determine which branch to take)

Binary Tree

struct node{ int data; *node left, right;}*node top;

*node temp = top;int search_value, found=0;

do{ if (temp->left) prefetch(temp->left); if (temp->right) prefetch(temp->right); temp = next_node(temp, search_value, &found);}until (found);

Conclusion

Some methods improve contrived cases, but are they always useful? Loop fusion Array merge

Prefetch works well for predictable access patterns Dynamic memory and pointers? Is prefetch worthwhile for large block size and random

access of small elements?

Conclusion

Cache miss time measured in clock cycles is increasingRequires prefetch farther ahead – larger

caches Software methods are static

Low cost of implementationPotentially pipeline independent

software methods to increase data cache performance presented by philip marshall

Documents

n i s2i

n i s1i

int b int c

nfor int

int asize

cache line size

ai bifor int

vector addition prefetchonly