software methods to increase data cache performance presented by philip marshall
TRANSCRIPT
Software Methods to Increase Data Cache Performance
Presented by Philip Marshall
Outline
Introduction Example: Multiple Vector Additions Example: Linked List Example: Binary Tree Conclusion
Introduction
Cache hit time is critical to system performance Often determines a processor’s clock period Cache controllers must be as simple as possible
The miss rate of a cache can be decreased if we know something about the access patterns
If we use software to use better access patterns or hint at how the cache can best be used, we can improve performance
Introduction
Various methods can be used: Loop Fusion – combine multiple loops that access the
same elements Array Merge – combine multiple arrays to increase
spatial locality Cache Prefetch – ask for values to be loaded into
cache in advance Cache Bypass – prevent certain accesses from
allocating in the cache
Vector Addition – Base Code
#define SIZE_N 1024
int a[SIZE_N], b[SIZE_N], c[SIZE_N];int s1[SIZE_N], s2[SIZE_N];
for (int i = 0; i < SIZE_N; i++) s1[i] = a[i] + b[i];
for (int i = 0; i < SIZE_N; i++) s2[i] = a[i] + c[i];
Vector Addition – Base Code
Assume a perfect instruction cache Ignore conflict data misses Assume a cache line size of 4 words Assume write miss penalties can be hidden First loop:
a, b: 256 misses each (every 4th access) Second loop:
a, c: 256 misses each unless cache is large enough to hold entire a and b arrays
1024 total misses
Vector Addition – Loop Fusion
#define SIZE_N 1024int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N];
for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];}
Vector Addition – Loop Fusion
a, b, c: 256 misses each 768 total misses Are there always loops that can be
combined?
Vector Addition – Array Merge
#define SIZE_N 1024struct vectors_type{ int a; int b; int c;}int s1[SIZE_N], s2[SIZE_N];
vectors_type vectors[SIZE_N];
for (int i = 0; i < SIZE_N; i++){ s1[i] = vectors[i].a + vectors[i].b; s2[i] = vectors[i].a + vectors[i].c;}
Vector Addition – Array Merge
3072 accesses, every 4th one misses 768 misses May not be a viable optimization method in
all cases If we have a large set of vectors and want to
be able to add any twoDynamic memory allocationWhat if we only want to traverse one vector?
Vector Addition – Prefetch
Speculatively load data into cache before we need it
Useful if we know which data we need far enough in advance
Assume prefetch is useful if we know the address 10 iterations in advance
Assume prefetch past end of array is non-faulting
Vector Addition – Prefetch
#define SIZE_N 1024int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N],
s2[SIZE_N]
for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];
prefetch(a[i+10]); prefetch(b[i+10]); prefetch(c[i+10]);}
Vector Addition – Prefetch
Only 30 misses 3072 prefetch instructions issued Does the cost outweigh the benefit?
768 – 30 = 738 fewer misses Miss cost only needs to be 4.2 cycles for prefetch be
worthwhile Multiple issue processors can help hide the cost of
issuing prefetches Improves performance even if we’re only adding
2 vectors
Vector Addition – Prefetch
Do we want a special load instruction that prefetches several blocks ahead?Reduces instruction countWorks in the case of sequential access, but
what if we want to prefetch from non-contiguous locations?
Vector Addition – Cache Bypass
Assume a 2-set fully associative cache with 4 word line size
for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];}
Assume write non-allocate Very worst case: cache always misses (4096 misses) If we use LRU and write our assembly so that a is
always in cache: 2048 misses for b[i] and c[i] + 256 misses for a[i] = 2304
If we use non-caching reads for c[i]: 1024 misses a[i] and b[i] 256 misses each: 1536 total
Linked List
Suppose we are sequentially traversing a linked list
We can prefetch the next several items Calculating addresses repeatedly could be
expensive (requires multiple memory accesses)
Use 2 pointers: one for prefetch
Linked List – Base Codestruct linked{ int data; *linked next;}*linked start;
*linked temp = start;int a[SIZE], index=0;while (temp->next){ a[index++] = temp->data; temp = temp->next;}
Linked List – Prefetchstruct linked{ int data; *linked next;}*linked start;
*linked temp = start, temp2 = start;int a[SIZE], index=0;for (int i = 0; i < 10; i++) temp2 = temp2->next;
while (temp->next){ a[index++] = temp->data; temp = temp->next; if (temp2->next){ temp2 = temp2->next; prefetch(temp2->next); }}
Linked List – Prefetch
Instead of every element potentially missing the cache, only the first 10 do
If prefetch takes longer to complete, more cache space is necessary
Binary Tree
Suppose we are traversing a binary tree where we can’t easily predict which branch we’ll access next. Is prefetch useful?
We can speculatively prefetch all values How far down tree? Cache Pollution
May be valuable to speculatively fetch next two possible elements if we can do useful work until the prefetch completes (ie, if it takes enough cycles to determine which branch to take)
Binary Tree
struct node{ int data; *node left, right;}*node top;
*node temp = top;int search_value, found=0;
do{ if (temp->left) prefetch(temp->left); if (temp->right) prefetch(temp->right); temp = next_node(temp, search_value, &found);}until (found);
Conclusion
Some methods improve contrived cases, but are they always useful? Loop fusion Array merge
Prefetch works well for predictable access patterns Dynamic memory and pointers? Is prefetch worthwhile for large block size and random
access of small elements?
Conclusion
Cache miss time measured in clock cycles is increasingRequires prefetch farther ahead – larger
caches Software methods are static
Low cost of implementationPotentially pipeline independent