cs 3214 computer systems godmar back lecture 11. announcements stay tuned for exercise 5 project 2...

CS 3214Computer Systems

Godmar Back

Lecture 11

Announcements

• Stay tuned for Exercise 5• Project 2 due Sep 30• Auto-fail rule 2:

– Need at least Firecracker to blow up to pass class.

CS 3214 Fall 2010

CODE OPTIMIZATIONPart 4

CS 3214 Fall 2010

Some of the following slides are taken with permission from Complete Powerpoint Lecture Notes forComputer Systems: A Programmer's Perspective (CS:APP)

Randal E. Bryant and David R. O'Hallaron

http://csapp.cs.cmu.edu/public/lectures.html

http://www.cs.cmu.edu/~bryant

http://www.cs.cmu.edu/~droh

MEMORY HIERARCHIES

CS 3214 Fall 2010

Locality• Principle of Locality:

– Programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves.

– Temporal locality: Recently referenced items are likely to be referenced in the near future.

– Spatial locality: Items with nearby addresses tend to be referenced close together in time.

Locality Example:• Data

– Reference array elements in succession (stride-1 reference pattern):

– Reference sum each iteration:• Instructions

– Reference instructions in sequence:– Cycle through loop repeatedly:

sum = 0;for (i = 0; i < n; i++)

sum += a[i];return sum;

Spatial locality

Spatial locality

Temporal locality

Temporal locality

CS 3214 Fall 2010

The CPU-Memory Gap• The increasing gap between DRAM, disk,

and CPU speeds.

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

1980 1985 1990 1995 2000

year

ns

Disk seek time

DRAM access time

SRAM access time

CPU cycle time

CS 3214 Fall 2010

Memory Hierarchies

• Motivated by some fundamental and enduring properties of hardware and software:– Fast storage technologies cost more per byte and have less

capacity. – The gap between CPU and main memory speed is widening.– Well-written programs tend to exhibit good locality.

• These fundamental properties complement each other beautifully.

• They suggest an approach for organizing memory and storage systems known as a memory hierarchy.

CS 3214 Fall 2010

CS 3214 Fall 2010

An Example Memory Hierarchy

registers

on-chip L1cache (SRAM)

main memory(DRAM)

local secondary storage(local disks)

Larger, slower, and cheaper (per byte)storagedevices

remote secondary storage(distributed file systems, Web servers)

Local disks hold files retrieved from disks on remote network servers.

Main memory holds disk blocks retrieved from local disks.

off-chip L2cache (SRAM)

L1 cache holds cache lines retrieved from the L2 cache memory.

CPU registers hold words retrieved from L1 cache.

L2 cache holds cache lines retrieved from main memory.

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,faster,and costlier(per byte)storage devices

Caches

• Cache: A smaller, faster storage device that acts as a staging area for a subset of the data in a larger, slower device.

• Fundamental idea of a memory hierarchy:– For each k, the faster, smaller device at level k serves as a cache

for the larger, slower device at level k+1.• Why do memory hierarchies work?

– Programs tend to access the data at level k more often than they access the data at level k+1.

– Thus, the storage at level k+1 can be slower, and thus larger and cheaper per bit.

– Net effect: A large pool of memory that costs as much as the cheap storage near the bottom, but that serves data to programs at the rate of the fast storage near the top.

CS 3214 Fall 2010

Caching in a Memory Hierarchy

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Larger, slower, cheaper storagedevice at level k+1 is partitionedinto blocks.

Data is copied betweenlevels in block-sized transfer units

8 9 14 3Smaller, faster, more expensivedevice at level k caches a subset of the blocks from level k+1

Level k:

Level k+1: 4

4

4 10

10

10

CS 3214 Fall 2010

Request14Request12

General Caching Concepts• Program needs object d, which is stored

in some block b.• Cache hit

– Program finds b in the cache at level k. E.g., block 14.

• Cache miss– b is not at level k, so level k cache must

fetch it from level k+1. E.g., block 12.– If level k cache is full, then some current

block must be replaced (evicted). Which one is the “victim”?

• Placement policy: where can the new block go? E.g., b mod 4

• Replacement policy: which block should be evicted? E.g., LRU

9 3

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

Level k:

Level k+1:

1414

12

14

4*

4*12

12

0 1 2 3

Request12

4*4*12

CS 3214 Fall 2010

Types of cache misses

• Cold (compulsory) miss– Cold misses occur because the cache is empty.

• Conflict miss– Most caches limit blocks at level k+1 to a small subset

(sometimes a singleton) of the block positions at level k.– E.g. Block i at level k+1 must be placed in block (i mod 4) at level

k+1.– Conflict misses occur when the level k cache is large enough, but

multiple data objects all map to the same level k block.– E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.

• Capacity miss– Occurs when the set of active cache blocks (working set) is larger

than the cache.

CS 3214 Fall 2010

Cache Performance Metrics• Miss Rate

– Fraction of memory references not found in cache (misses/references)– Typical numbers:

• 3-10% for L1• can be quite small (e.g., < 1%) for L2, depending on size, etc.

• Hit Time– Time to deliver a line in the cache to the processor (includes time to

determine whether the line is in the cache)– Typical numbers:

• 1 clock cycle for L1• 3-8 clock cycles for L2

• Miss Penalty– Additional time required because of a miss

• Typically 25-100 cycles for main memory

• Q.: What is the average access time?

CS 3214 Fall 2010

Direct-mapped vs. Set Associative Caches

• The more lines there are available to hold a block of data, the less likely the chance for conflict misses

• Direct-mapped caches– Exactly 1 location

• N-Way associative– Each set has N lines in which to hold block

• Fully associative– Block can be held in any line

CS 3214 Fall 2010

Examples of Caching in the Hierarchy

Hardware0On-Chip TLBAddress translations

TLB

Web browser

10,000,000Local diskWeb pagesBrowser cache

Web cache

Network buffer cache

Buffer cache

Virtual Memory

L2 cache

L1 cache

Registers

Cache Type

Web pages

Parts of files

Parts of files

4-KB page

32-byte block

32-byte block

4-byte word

What Cached

Web proxy server

1,000,000,000Remote server disks

OS100Main memory

Hardware1On-Chip L1

Hardware10Off-Chip L2

AFS/NFS client

10,000,000Local disk

Hardware+OS

100Main memory

Compiler0CPU registers

Managed By

Latency (cycles)

Where Cached

CS 3214 Fall 2010

Locality Example

• Claim: Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer.

• Question: Which of these functions has good locality?int sumarraycols(int a[M][N]){ int i, j, sum = 0;

for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}

int sumarrayrows(int a[M][N]){ int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}

CS 3214 Fall 2010

Writing Cache Friendly Code

• Repeated references to variables are good (temporal locality)

• Stride-1 reference patterns are good (spatial locality)• Examples:

– cold cache, 4-byte words, 4-word cache blocksint sumarrayrows(int a[M][N]){ int i, j, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}

int sumarraycols(int a[M][N]){ int i, j, sum = 0;

for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}

Miss rate = Miss rate = 1/4 = 25% 100%

CS 3214 Fall 2010

Locality Example (2)

• Question: Can you permute the loops so that the function scans the 3-d array a[] with a stride-1 reference pattern (and thus has good spatial locality)?

int sumarray3d(int a[M][N][N]){ int i, j, k, sum = 0;

for (i = 0; i < M; i++) for (j = 0; j < N; j++) for (k = 0; k < N; k++) sum += a[k][i][j]; return sum}

CS 3214 Fall 2010

Locality Example (3)• Question: Which of these two exhibits

better spatial locality?// struct of arraysstruct soa { float *x; float *y; float *r;};

compute_r(struct soa s) { for (i = 0; …) { s.r[i] = s.y[i] * s.y[i] + s.x[i] * s.x[i]; }}

// struct of arraysstruct soa { float x; float y; float r;};

compute_r(struct soa *s) { for (i = 0; …) { s[i].r = s[i].x * s[i].x + s[i].y * s[i].y; }}

CS 3214 Fall 2010


better spatial locality?// struct of arraysstruct soa { float *x; float *y; float *r;};

sum_r(struct soa s) { sum = 0; for (i = 0; …) { sum += s.r[i]; }}


compute_r(struct soa *s) { sum = 0; for (i = 0; …) { sum += s[i].r; }}

CS 3214 Fall 2010


better spatial locality?// struct of arraysstruct soaa { float *x; float *y; float *r;};struct soaget_xyr(struct soaa s, int i) { return ({ .x = s.x[i], .y = s.y[i], .r = s.r[i]}); }


struct soaget_xyr(struct soa *s, int i){ return s[i];}

CS 3214 Fall 2010

The Memory Mountain

• Read throughput (read bandwidth)– Number of bytes read from memory per

second (MB/s)• Memory mountain

– Measured read throughput as a function of spatial and temporal locality.

– Compact way to characterize memory system performance.

CS 3214 Fall 2010

Memory Mountain Test Function/* The test function */void test(int elems, int stride) { int i, result = 0; volatile int sink;

for (i = 0; i < elems; i += stride)result += data[i];

sink = result; /* So compiler doesn't optimize away the loop */}

/* Run test(elems, stride) and return read throughput (MB/s) */double run(int size, int stride, double Mhz){ double cycles; int elems = size / sizeof(int);

test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */}

CS 3214 Fall 2010

Memory Mountain Main Routine/* mountain.c - Generate the memory mountain. */#define MINBYTES (1 << 10) /* Working set size ranges from 1 KB */#define MAXBYTES (1 << 23) /* ... up to 8 MB */#define MAXSTRIDE 16 /* Strides range from 1 to 16 */#define MAXELEMS MAXBYTES/sizeof(int)

int data[MAXELEMS]; /* The array we'll be traversing */

int main(){ int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */

init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Mhz = mhz(0); /* Estimate the clock frequency */ for (size = MAXBYTES; size >= MINBYTES; size >>= 1) {

for (stride = 1; stride <= MAXSTRIDE; stride++) printf("%.1f\t", run(size, stride, Mhz));printf("\n");

} exit(0);} CS 3214 Fall 2010

The Memory Mountain

s1

s3

s5

s7

s9

s11

s13

s15

8m

2m 512k 12

8k 32k 8k

2k

0

200

400

600

800

1000

1200

read

th

rou

gh

pu

t (M

B/s

)

stride (words) working set size (bytes)

Pentium III Xeon550 MHz16 KB on-chip L1 d-cache16 KB on-chip L1 i-cache512 KB off-chip unifiedL2 cache

Ridges ofTemporalLocality

L1

L2

mem

Slopes ofSpatialLocality

xe

CS 3214 Fall 2010

Ridges of Temporal Locality

• Slice through the memory mountain with stride=1– illuminates read throughputs of different caches and memory

0

200

400

600

800

1000

1200

8m

4m

2m

10

24

k

51

2k

25

6k

12

8k

64

k

32

k

16

k

8k

4k

2k

1k

working set size (bytes)

rea

d t

hro

ug

pu

t (M

B/s

)L1 cache

regionL2 cache

regionmain memory

region

CS 3214 Fall 2010

A Slope of Spatial Locality• Slice

through memory mountain with size=256KB– shows

cache block size.

0

100

200

300

400

500

600

700

800

s1 s2 s3 s4 s5 s6 s7 s8 s9 s10 s11 s12 s13 s14 s15 s16

stride (words)

rea

d t

hro

ug

hp

ut

(MB

/s)

one access per cache line

CS 3214 Fall 2010

Matrix Multiplication Example

• Major Cache Effects to Consider– Total cache size

• Exploit temporal locality and keep the working set small (e.g., by using blocking)

– Block size• Exploit spatial locality

• Description:– Multiply N x N matrices– O(N3) total operations– Accesses

• N reads per source element• N values summed per destination

– but may be able to hold in register

/* ijk */for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}


Variable sumheld in register

CS 3214 Fall 2010

Miss Rate Analysis for Matrix Multiply

• Assume:– Line size = 32B (big enough for 4 64-bit words)– Matrix dimension (N) is very large

• Approximate 1/N as 0.0

– Cache is not even big enough to hold multiple rows• Analysis Method:

– Look at access pattern of inner loop

CA

k

i

B

k

j

i

j

CS 3214 Fall 2010

Layout of C Arrays in Memory (review)

• C arrays allocated in row-major order– each row in contiguous memory locations

• Stepping through columns in one row:– for (i = 0; i < N; i++)

sum += a[0][i];– accesses successive elements– if block size (B) > 4 bytes, exploit spatial locality

• compulsory miss rate = 4 bytes / B• Stepping through rows in one column:

– for (i = 0; i < n; i++)sum += a[i][0];

– accesses distant elements– no spatial locality!

• compulsory miss rate = 1 (i.e. 100%)CS 3214 Fall 2010

Matrix Multiplication (ijk)



A B C

(i,*)

(*,j)(i,j)

Inner loop:

Column-wise

Row-wise Fixed

Misses per Inner Loop Iteration:A B C

0.25 1.0 0.0

CS 3214 Fall 2010

Matrix Multiplication (jik)

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

/* jik */for (j=0; j<n; j++) { for (i=0; i<n; i++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum }}

A B C

(i,*)

(*,j)(i,j)

Inner loop:

Row-wise Column-wise

Fixed


0.25 1.0 0.0

CS 3214 Fall 2010

Matrix Multiplication (kij)

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

/* kij */for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed


0.0 0.25 0.25

CS 3214 Fall 2010

Matrix Multiplication (ikj)

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

/* ikj */for (i=0; i<n; i++) { for (k=0; k<n; k++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

A B C

(i,*)(i,k) (k,*)

Inner loop:

Row-wise Row-wiseFixed


0.0 0.25 0.25

CS 3214 Fall 2010

Matrix Multiplication (jki)

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

/* jki */for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

Column -wise

Column-wise

Fixed


1.0 0.0 1.0

CS 3214 Fall 2010

Matrix Multiplication (kji)

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

/* kji */for (k=0; k<n; k++) { for (j=0; j<n; j++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

A B C

(*,j)(k,j)

Inner loop:

(*,k)

FixedColumn-wise

Column-wise


1.0 0.0 1.0

CS 3214 Fall 2010

Summary of Matrix Multiplication

for (i=0; i<n; i++) { for (j=0; j<n; j++) { sum = 0.0; for (k=0; k<n; k++) sum += a[i][k] * b[k][j]; c[i][j] = sum; }}

ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25

for (k=0; k<n; k++) { for (i=0; i<n; i++) { r = a[i][k]; for (j=0; j<n; j++) c[i][j] += r * b[k][j]; }}

for (j=0; j<n; j++) { for (k=0; k<n; k++) { r = b[k][j]; for (i=0; i<n; i++) c[i][j] += a[i][k] * r; }}

kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5

jki (& kji): • 2 loads, 1 store• misses/iter = 2.0

CS 3214 Fall 2010

Pentium Matrix Multiply Performance• Miss rates are helpful but not perfect predictors.• Combination of miss rate & load/stores

0

10

20

30

40

50

60

25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400

Array size (n)

Cyc

les

/ite

rati

on

kjijkikijikjjikijk

CS 3214 Fall 2010

Improving Temporal Locality by Blocking

• Example: Blocked matrix multiplication– “block” (in this context) does not mean “cache block”.– Instead, it mean a sub-block within the matrix.– Example: N = 8; sub-block size = 4

C11 = A11B11 + A12B21 C12 = A11B12 + A12B22

C21 = A21B11 + A22B21 C22 = A21B12 + A22B22

A11 A12

A21 A22

B11 B12

B21 B22

X = C11 C12

C21 C22

Key idea: Sub-blocks (i.e., Axy) can be treated just like scalars.

C11 = 0C12 = 0C21 = 0C22 = 0C11 += A11B11

C21 += A21B11

C11 += A12B21

C21 += A22B21

C12 += A11B12 C22 += A21B12 C12 += A12B22

C22 += A22B22

CS 3214 Fall 2010

Blocked Matrix Multiply (bijk)for (jj=0; jj<n; jj+=bsize) { for (i=0; i<n; i++) for (j=jj; j < min(jj+bsize,n); j++) c[i][j] = 0.0; for (kk=0; kk<n; kk+=bsize) { for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } }}

CS 3214 Fall 2010

Blocked Matrix Multiply Analysis– Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X

bsize block of B and accumulates into 1 X bsize sliver of C– Loop over i steps through n row slivers of A & C, using same B

A B C

block reused n times in succession

row sliver accessedbsize times

Update successiveelements of sliver

i ikk

kk jjjj

for (i=0; i<n; i++) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; }

InnermostLoop Pair

CS 3214 Fall 2010

Intuition For Blocking

• Matrix multiply takes n^3 multiplications• Every value of A is used in n multiplications (there

are n^2 many)• Every value of B is used in n multiplications (ditto)• Flop-to-memory ratio: n^3/(n^2+n^2+n^2).• Note access sequence for B:

– b[k][j], b[k+1][j], … b[k+bsize-1][j], b[k][j+1], b[k+1][j+1], … b[k+bsize-1][j+1],… b[k][j+bsize-1], b[k+1][j+bsize-1], … b[k+bsize-1][j+bsize-1]

CS 3214 Fall 2010

CS 3214 Fall 2010

Access pattern for arrays A, B, C in C = A * B unblocked (ijk)

CS 3214 Fall 2010

Access pattern for arrays A, B, C in C = A * B blocked (bijk)

Pentium Blocked Matrix Multiply Performance

• Blocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)– relatively insensitive to array size.

0

10

20

30

40

50

60

Array size (n)

Cy

cle

s/it

era

tio

n

kji

jki

kij

ikj

jik

ijk

bijk (bsize = 25)

bikj (bsize = 25)

CS 3214 Fall 2010

Concluding Observations

• Programmer can optimize for cache performance– How data structures are organized– How data are accessed

• Nested loop structure• Blocking is a general technique

• All systems favor “cache friendly code”– Getting absolute optimum performance is very platform

specific• Cache sizes, line sizes, associativity, etc.

– Can get most of the advantage with generic code• Keep working set reasonably small (temporal locality)• Use small strides (spatial locality)

CS 3214 Fall 2010

Sparse Problems

• Dense matrix multiple is suitable for blocking because of its high flop-to-memory ratio

• Some problems aren’t. Those problems are much tougher to optimize.

• Optimizing Sparse Matrix Vector Multiply– See Belgin, Back, Ribbens ICS ‘09

CS 3214 Fall 2010

cs 3214 computer systems godmar back lecture 11. announcements stay tuned for exercise 5 project 2...

Documents