tuesday, september 19, 2006

Tuesday, September 19, 2006

The practical scientist is trying to solve tomorrow's problem

on yesterday's computer. Computer scientists often have

it the other way around.

- Numerical Recipes, C Edition

Reference Material Lectures 1 & 2

“Parallel Computer Architecture” by David Culler et. al., Chapter 1. “Sourcebook of Parallel Computing” by Jack Dongarra et. al.,

Chapters 1 and 2. Introduction to Parallel Computing by Grama et. al., Chapter 1 and

Chapter 2 §2.4. www.top500.org

Lecture 3 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3 Introduction to Parallel Computing, Lawrence Livermore National

Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ Lecture 4 & 5

“Techniques for Optimizing Applications” by Garg et. al., Chapter 9 “Software Optimizations for High Performance Computing” by

Wadleigh et. al., Chapter 5 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1-

Software Optimizations

Optimize serial code before parallelizing it.

Loop Unrolling

do i=1,n

A(i)=B(i)

do i=1,n,4

A(i)=B(i)

A(i+1)=B(i+1)

A(i+2)=B(i+2)

A(i+3)=B(i+3)

enddo•Unrolled by 4.•Some compilers allow users to specify unrolling depth.•Avoid excessive unrolling: Register pressure / spills can hurt performance•Pipelining to hide instruction latencies•Reduces overhead of index increment and conditional check

Assumption n is divisible by 4

Loop Unrolling

do j=1 to N

do i = 1 to N

Z[i,j]=Z[i,j]+X[i]*Y[j]

Unroll outer loop by 2

Loop Unrolling

do j=1 to N

do i = 1 to N

do j=1 to N step 2

do i = 1 to N

Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1]

Loop Unrolling

do j=1 to N

do i = 1 to N

do j=1 to N step 2

do i = 1 to N

Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1]

Number of load operations can be reduced e.g. Half as many loads of X

Loop Fusion

Beneficial in loop-intensive programs.Decreases index calculation overhead.Can also help in instruction level

parallelism.Beneficial if same data structures are

used in different loops.

Loop Fusion

for (i=0; i<n; i++)

temp[i] =x[i]*y[i];

for (i=0; i<n; i++)

z[i] =w[i]+temp[i];

Loop Fusion

for (i=0; i<n; i++)

temp[i] =x[i]*y[i];

for (i=0; i<n; i++)

z[i] =w[i]+temp[i];

for (i=0; i<n; i++)

z[i] =x[i]*y[i]+w[i];

Check for register pressure before fusing

Loop Fission

Condition statements can hurt pipeliningSplit into two, one with condition

statements and the other without.Compiler can do optimizations in

condition-free loop like unrolling.Beneficial for fat loops that may lead to

register spills

Loop Fission

for (i=0;i<nodes;i++) {

a[i] = a[i]*small;

dtime = a[i] + b[i];

dtime = fabs(dtime*ratinpmt);

temp1[i] = dtime*relaxn;

if(temp1[i] > hgreat) {

temp1[i]=1;

Loop Fission

for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime =

fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } }

for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime =

fabs(dtime*ratinpmt); temp1[i] =

dtime*relaxn;}for (i=0;i<nodes;i++) { if(temp1[i] > hgreat)

{ temp1[i]=1;

Reductions

for (i=0; i<n; i++)

sum +=x[i];

Normally a single register would be used for reduction variable.

Hide floating point instruction latency?

Reductionsfor (i=0; i<n; i++)

sum +=x[i];

sum1=sum2=sum3=sum4=0.0

nend = (n>>2)<<2;

for (i=0; i<nend; i+=4){

sum1 +=x[i];

sum2 +=x[i+1];

sum3 +=x[i+2];

sum4 +=x[i+3];

sumx = sum1 + sum2+ sum3 + sum4;

for (i=nend; i<n; i++)

sumx += x[i]

a**0.5 vs sqrt(a)

a**0.5 vs sqrt(a) Appropriate include files can help in

generating faster code. e.g. math.h

The time to access memory has not kept pace with CPU clock speeds.

Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them.

Wastage of CPU cycles: CPU starvation

Ability of memory system to feed data to the processor Memory latency Memory Bandwidth

Effect of Memory Latency

1 GHz processor (1ns clock) Capable of executing 4 instructions in each

cycle of 1ns

DRAM with latency 100nsCache block size : 1 wordPeak processor rating?

cycle of 1ns

DRAM with latency 100ns (no caches)Memory block 1 wordPeak processor rating 4 GFlops

cycle of 1nsDRAM with latency 100ns (no caches)Memory block: 1 wordPeak processor rating 4 GFlopsDot product of two vectors Peak speed of computation?

Effect of Memory Latency1 GHz processor (1ns clock)

Capable of executing 4 instructions in each cycle of 1ns

DRAM with latency 100ns (no caches)Memory block 1 wordPeak processor rating 4 GFlops• Dot product of two vectors • Peak speed of computation? one floating point

operation every 100ns i.e. speed of 10 MFLOPS

Effect of Memory Latency: Introduce Cache

1 GHz processor (1ns clock) Capable of executing 4 instructions in each cycle of

DRAM with latency 100ns Memory block 1 wordCache 32KB with 1ns latencyMultiply two matrices A and B of 32x32 words

with result in C. (Note: Previous example had no data reuse).

Assume ideal cache placement and enough capacity to hold A,B and C

Multiply two matrices A and B of 32x32 words with result in C

32x32 = 1K wordsTotal operations and total time taken?

Multiply two matrices A and B of 32x32 words with result in C

32x32 = 1K wordsTotal operations and total time taken?Two matrices = 2K require wordsMultiplying two matrices requires 2n3

operations

Effect of Memory Latency: Introduce CacheMultiply two matrices A and B of 32x32 words

with result in C32x32 = 1KTwo matrices = 2K require 2K *100ns = 200µs.Multiplying two matrices requires 2n3

operations = 2*323 = 64K operations 4 operations per cycle we need 64K/4 cycles =

16µsTotal time = 200+16µsComputation rate 64K operations/(200+16µs) =

303 MFLOPS

Effect of Memory Bandwidth

cycle of 1ns

DRAM with latency 100ns Memory block 4 wordsCache 32KB with 1ns latencyDot product example againBandwidth increased 4 fold

Reduce cache misses.Spatial localityTemporal locality

Impact of strided access

for (i=0; i<1000; i++)

column_sum[i] = 0.0;

for(j=0; j<1000; j++)

column_sum[i]+= b[j][i];

Eliminating strided access

for (i=0; i<1000; i++)

column_sum[i] = 0.0;

for(j=0; j<1000; j++)

for (i=0; i<1000; i++)

column_sum[i]+= b[j][i];

Assumption: Vector column_sum is retained in the cache

do i = 1, N

do j = 1, N

A[i] =A[i] + B[j]

N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop.

Little reuse between touches

How many cache misses for A and B?

tuesday, september 19, 2006

yj enddoenddodo j

enddoenddoloop unrollingdo

n step

grama et

wadleigh et

garg et

yj enddoenddounroll

conditionfree loop

Documents

tuesday, february 14, 2006

tuesday, september 18, 2007

1 presentation crisis management tuesday september 29 st....

september 23, 2014 (tuesday)

unity! tuc 2006 tuesday

tuesday, september 5

mitchell lake association charter meeting tuesday, september...

monday 4th september tuesday 5th september …€¦ ·...

legislative assembly tuesday september

tuesday, september 18, 2012

tuesday september 2nd , 2008

tuesday, september 2, 2014

monday, september 9 tuesday, september 10 wednesday...

good morning tuesday, september 29, 2015. good afternoon...

tuesday, september 12, 2006

federal register /vol. 71, no. 176/tuesday, september 12...

tuesday, september 27 th

six sigma at boston scientific tuesday 12 september 2006...

tuesday, september 28, 2010

tuesday, 25 july 2006