tuesday, september 19, 2006

Post on 15-Jan-2016

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Tuesday, September 19, 2006. The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it the other way around. - Numerical Recipes, C Edition. Reference Material. Lectures 1 & 2 - PowerPoint PPT Presentation

TRANSCRIPT

1

Tuesday, September 19, 2006

The practical scientist is trying to solve tomorrow's problem

on yesterday's computer. Computer scientists often have

it the other way around.

- Numerical Recipes, C Edition

2

Reference Material Lectures 1 & 2

“Parallel Computer Architecture” by David Culler et. al., Chapter 1. “Sourcebook of Parallel Computing” by Jack Dongarra et. al.,

Chapters 1 and 2. Introduction to Parallel Computing by Grama et. al., Chapter 1 and

Chapter 2 §2.4. www.top500.org

Lecture 3 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.3 Introduction to Parallel Computing, Lawrence Livermore National

Laboratory, http://www.llnl.gov/computing/tutorials/parallel_comp/ Lecture 4 & 5

“Techniques for Optimizing Applications” by Garg et. al., Chapter 9 “Software Optimizations for High Performance Computing” by

Wadleigh et. al., Chapter 5 Introduction to Parallel Computing by Grama et. al., Chapter 2 §2.1-

2.2

3

Software Optimizations

Optimize serial code before parallelizing it.

4

Loop Unrolling

do i=1,n

A(i)=B(i)

enddo

do i=1,n,4

A(i)=B(i)

A(i+1)=B(i+1)

A(i+2)=B(i+2)

A(i+3)=B(i+3)

enddo•Unrolled by 4.•Some compilers allow users to specify unrolling depth.•Avoid excessive unrolling: Register pressure / spills can hurt performance•Pipelining to hide instruction latencies•Reduces overhead of index increment and conditional check

Assumption n is divisible by 4

5

Loop Unrolling

do j=1 to N

do i = 1 to N

Z[i,j]=Z[i,j]+X[i]*Y[j]

enddo

enddo

Unroll outer loop by 2

6

Loop Unrolling

do j=1 to N

do i = 1 to N

Z[i,j]=Z[i,j]+X[i]*Y[j]

enddo

enddo

do j=1 to N step 2

do i = 1 to N

Z[i,j]=Z[i,j]+X[i]*Y[j]

Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1]

enddo

enddo

7

Loop Unrolling

do j=1 to N

do i = 1 to N

Z[i,j]=Z[i,j]+X[i]*Y[j]

enddo

enddo

do j=1 to N step 2

do i = 1 to N

Z[i,j]=Z[i,j]+X[i]*Y[j]

Z[i,j+1]=Z[i,j+1]+X[i]*Y[j+1]

enddo

enddo

Number of load operations can be reduced e.g. Half as many loads of X

8

Loop Fusion

Beneficial in loop-intensive programs.Decreases index calculation overhead.Can also help in instruction level

parallelism.Beneficial if same data structures are

used in different loops.

9

Loop Fusion

for (i=0; i<n; i++)

temp[i] =x[i]*y[i];

for (i=0; i<n; i++)

z[i] =w[i]+temp[i];

10

Loop Fusion

for (i=0; i<n; i++)

temp[i] =x[i]*y[i];

for (i=0; i<n; i++)

z[i] =w[i]+temp[i];

for (i=0; i<n; i++)

z[i] =x[i]*y[i]+w[i];

Check for register pressure before fusing

11

Loop Fission

Condition statements can hurt pipeliningSplit into two, one with condition

statements and the other without.Compiler can do optimizations in

condition-free loop like unrolling.Beneficial for fat loops that may lead to

register spills

12

Loop Fission

for (i=0;i<nodes;i++) {

a[i] = a[i]*small;

dtime = a[i] + b[i];

dtime = fabs(dtime*ratinpmt);

temp1[i] = dtime*relaxn;

if(temp1[i] > hgreat) {

temp1[i]=1;

}

}

13

Loop Fission

for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime =

fabs(dtime*ratinpmt); temp1[i] = dtime*relaxn; if(temp1[i] > hgreat) { temp1[i]=1; } }

for (i=0;i<nodes;i++) { a[i] = a[i]*small; dtime = a[i] + b[i]; dtime =

fabs(dtime*ratinpmt); temp1[i] =

dtime*relaxn;}for (i=0;i<nodes;i++) { if(temp1[i] > hgreat)

{ temp1[i]=1;

}}

14

Reductions

for (i=0; i<n; i++)

{

sum +=x[i];

}

Normally a single register would be used for reduction variable.

Hide floating point instruction latency?

15

Reductionsfor (i=0; i<n; i++)

{

sum +=x[i];

}

sum1=sum2=sum3=sum4=0.0

nend = (n>>2)<<2;

for (i=0; i<nend; i+=4){

sum1 +=x[i];

sum2 +=x[i+1];

sum3 +=x[i+2];

sum4 +=x[i+3];

}

sumx = sum1 + sum2+ sum3 + sum4;

for (i=nend; i<n; i++)

sumx += x[i]

16

a**0.5 vs sqrt(a)

17

a**0.5 vs sqrt(a) Appropriate include files can help in

generating faster code. e.g. math.h

18

The time to access memory has not kept pace with CPU clock speeds.

Performance of a program can be suboptimal because data to perform the operations are not delivered from memory to registers by the time processor is ready to use them.

Wastage of CPU cycles: CPU starvation

19

20

Ability of memory system to feed data to the processor Memory latency Memory Bandwidth

21

Effect of Memory Latency

1 GHz processor (1ns clock) Capable of executing 4 instructions in each

cycle of 1ns

DRAM with latency 100nsCache block size : 1 wordPeak processor rating?

22

Effect of Memory Latency

1 GHz processor (1ns clock) Capable of executing 4 instructions in each

cycle of 1ns

DRAM with latency 100ns (no caches)Memory block 1 wordPeak processor rating 4 GFlops

23

Effect of Memory Latency

1 GHz processor (1ns clock) Capable of executing 4 instructions in each

cycle of 1nsDRAM with latency 100ns (no caches)Memory block: 1 wordPeak processor rating 4 GFlopsDot product of two vectors Peak speed of computation?

24

Effect of Memory Latency1 GHz processor (1ns clock)

Capable of executing 4 instructions in each cycle of 1ns

DRAM with latency 100ns (no caches)Memory block 1 wordPeak processor rating 4 GFlops• Dot product of two vectors • Peak speed of computation? one floating point

operation every 100ns i.e. speed of 10 MFLOPS

25

Effect of Memory Latency: Introduce Cache

1 GHz processor (1ns clock) Capable of executing 4 instructions in each cycle of

1ns

DRAM with latency 100ns Memory block 1 wordCache 32KB with 1ns latencyMultiply two matrices A and B of 32x32 words

with result in C. (Note: Previous example had no data reuse).

Assume ideal cache placement and enough capacity to hold A,B and C

26

Effect of Memory Latency: Introduce Cache

Multiply two matrices A and B of 32x32 words with result in C

32x32 = 1K wordsTotal operations and total time taken?

27

Effect of Memory Latency: Introduce Cache

Multiply two matrices A and B of 32x32 words with result in C

32x32 = 1K wordsTotal operations and total time taken?Two matrices = 2K require wordsMultiplying two matrices requires 2n3

operations

28

Effect of Memory Latency: Introduce CacheMultiply two matrices A and B of 32x32 words

with result in C32x32 = 1KTwo matrices = 2K require 2K *100ns = 200µs.Multiplying two matrices requires 2n3

operations = 2*323 = 64K operations 4 operations per cycle we need 64K/4 cycles =

16µsTotal time = 200+16µsComputation rate 64K operations/(200+16µs) =

303 MFLOPS

29

Effect of Memory Bandwidth

1 GHz processor (1ns clock) Capable of executing 4 instructions in each

cycle of 1ns

DRAM with latency 100ns Memory block 4 wordsCache 32KB with 1ns latencyDot product example againBandwidth increased 4 fold

30

Reduce cache misses.Spatial localityTemporal locality

31

Impact of strided access

for (i=0; i<1000; i++)

column_sum[i] = 0.0;

for(j=0; j<1000; j++)

column_sum[i]+= b[j][i];

32

Eliminating strided access

for (i=0; i<1000; i++)

column_sum[i] = 0.0;

for(j=0; j<1000; j++)

for (i=0; i<1000; i++)

column_sum[i]+= b[j][i];

Assumption: Vector column_sum is retained in the cache

33

do i = 1, N

do j = 1, N

A[i] =A[i] + B[j]

enddo

enddo

N is large so B[j] cannot remain in cache until it is used again in another iteration of outer loop.

Little reuse between touches

How many cache misses for A and B?

top related