software methods to increase data cache performance presented by philip marshall

23
Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Upload: terry-berfield

Post on 15-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Software Methods to Increase Data Cache Performance

Presented by Philip Marshall

Page 2: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Outline

Introduction Example: Multiple Vector Additions Example: Linked List Example: Binary Tree Conclusion

Page 3: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Introduction

Cache hit time is critical to system performance Often determines a processor’s clock period Cache controllers must be as simple as possible

The miss rate of a cache can be decreased if we know something about the access patterns

If we use software to use better access patterns or hint at how the cache can best be used, we can improve performance

Page 4: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Introduction

Various methods can be used: Loop Fusion – combine multiple loops that access the

same elements Array Merge – combine multiple arrays to increase

spatial locality Cache Prefetch – ask for values to be loaded into

cache in advance Cache Bypass – prevent certain accesses from

allocating in the cache

Page 5: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Base Code

#define SIZE_N 1024

int a[SIZE_N], b[SIZE_N], c[SIZE_N];int s1[SIZE_N], s2[SIZE_N];

for (int i = 0; i < SIZE_N; i++) s1[i] = a[i] + b[i];

for (int i = 0; i < SIZE_N; i++) s2[i] = a[i] + c[i];

Page 6: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Base Code

Assume a perfect instruction cache Ignore conflict data misses Assume a cache line size of 4 words Assume write miss penalties can be hidden First loop:

a, b: 256 misses each (every 4th access) Second loop:

a, c: 256 misses each unless cache is large enough to hold entire a and b arrays

1024 total misses

Page 7: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Loop Fusion

#define SIZE_N 1024int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N], s2[SIZE_N];

for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];}

Page 8: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Loop Fusion

a, b, c: 256 misses each 768 total misses Are there always loops that can be

combined?

Page 9: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Array Merge

#define SIZE_N 1024struct vectors_type{ int a; int b; int c;}int s1[SIZE_N], s2[SIZE_N];

vectors_type vectors[SIZE_N];

for (int i = 0; i < SIZE_N; i++){ s1[i] = vectors[i].a + vectors[i].b; s2[i] = vectors[i].a + vectors[i].c;}

Page 10: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Array Merge

3072 accesses, every 4th one misses 768 misses May not be a viable optimization method in

all cases If we have a large set of vectors and want to

be able to add any twoDynamic memory allocationWhat if we only want to traverse one vector?

Page 11: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Prefetch

Speculatively load data into cache before we need it

Useful if we know which data we need far enough in advance

Assume prefetch is useful if we know the address 10 iterations in advance

Assume prefetch past end of array is non-faulting

Page 12: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Prefetch

#define SIZE_N 1024int a[SIZE_N], b[SIZE_N], c[SIZE_N], s1[SIZE_N],

s2[SIZE_N]

for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];

prefetch(a[i+10]); prefetch(b[i+10]); prefetch(c[i+10]);}

Page 13: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Prefetch

Only 30 misses 3072 prefetch instructions issued Does the cost outweigh the benefit?

768 – 30 = 738 fewer misses Miss cost only needs to be 4.2 cycles for prefetch be

worthwhile Multiple issue processors can help hide the cost of

issuing prefetches Improves performance even if we’re only adding

2 vectors

Page 14: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Prefetch

Do we want a special load instruction that prefetches several blocks ahead?Reduces instruction countWorks in the case of sequential access, but

what if we want to prefetch from non-contiguous locations?

Page 15: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Vector Addition – Cache Bypass

Assume a 2-set fully associative cache with 4 word line size

for (int i = 0; i < SIZE_N; i++){ s1[i] = a[i] + b[i]; s2[i] = a[i] + c[i];}

Assume write non-allocate Very worst case: cache always misses (4096 misses) If we use LRU and write our assembly so that a is

always in cache: 2048 misses for b[i] and c[i] + 256 misses for a[i] = 2304

If we use non-caching reads for c[i]: 1024 misses a[i] and b[i] 256 misses each: 1536 total

Page 16: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Linked List

Suppose we are sequentially traversing a linked list

We can prefetch the next several items Calculating addresses repeatedly could be

expensive (requires multiple memory accesses)

Use 2 pointers: one for prefetch

Page 17: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Linked List – Base Codestruct linked{ int data; *linked next;}*linked start;

*linked temp = start;int a[SIZE], index=0;while (temp->next){ a[index++] = temp->data; temp = temp->next;}

Page 18: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Linked List – Prefetchstruct linked{ int data; *linked next;}*linked start;

*linked temp = start, temp2 = start;int a[SIZE], index=0;for (int i = 0; i < 10; i++) temp2 = temp2->next;

while (temp->next){ a[index++] = temp->data; temp = temp->next; if (temp2->next){ temp2 = temp2->next; prefetch(temp2->next); }}

Page 19: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Linked List – Prefetch

Instead of every element potentially missing the cache, only the first 10 do

If prefetch takes longer to complete, more cache space is necessary

Page 20: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Binary Tree

Suppose we are traversing a binary tree where we can’t easily predict which branch we’ll access next. Is prefetch useful?

We can speculatively prefetch all values How far down tree? Cache Pollution

May be valuable to speculatively fetch next two possible elements if we can do useful work until the prefetch completes (ie, if it takes enough cycles to determine which branch to take)

Page 21: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Binary Tree

struct node{ int data; *node left, right;}*node top;

*node temp = top;int search_value, found=0;

do{ if (temp->left) prefetch(temp->left); if (temp->right) prefetch(temp->right); temp = next_node(temp, search_value, &found);}until (found);

Page 22: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Conclusion

Some methods improve contrived cases, but are they always useful? Loop fusion Array merge

Prefetch works well for predictable access patterns Dynamic memory and pointers? Is prefetch worthwhile for large block size and random

access of small elements?

Page 23: Software Methods to Increase Data Cache Performance Presented by Philip Marshall

Conclusion

Cache miss time measured in clock cycles is increasingRequires prefetch farther ahead – larger

caches Software methods are static

Low cost of implementationPotentially pipeline independent