performance analysis · karp–flatt metric if we consider the parallelization overhead as another...

29
Performance analysis Performance analysis – p. 1

Upload: others

Post on 14-Apr-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Performance analysis

Performance analysis – p. 1

Page 2: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

An example of time measurements

Dark grey: time spent on computation, decreasing with pWhite: time spent on communication, increasing with p

Performance analysis – p. 2

Page 3: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Objectives of the lecture

How to roughly predict computing time as function of p?

How to analyze parallel execution times?

Understand the limit of using more processors

Chapters 7 from Michael J. Quinn, Parallel Programming in C with MPI andOpenMP

Performance analysis – p. 3

Page 4: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Notation

n problem sizep number of processorsσ(n) inherently sequential computationϕ(n) parallelizable computationκ(n, p) parallelization overhead

Speedup Ψ(n, p) =Sequential execution time

Parallel execution time

Efficiency ε(n, p) =Sequential execution time

Processors used × Parallel execution time

Performance analysis – p. 4

Page 5: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Simple observations

Sequential execution time = σ(n) + ϕ(n)

Parallel execution time ≥ σ(n) + ϕ(n)/p+ κ(n, p)

Speedup Ψ(n, p) ≤σ(n) + ϕ(n)

σ(n) + ϕ(n)/p+ κ(n, p)

Efficiency ε(n, p) ≤σ(n) + ϕ(n)

pσ(n) + ϕ(n) + pκ(n, p)

Performance analysis – p. 5

Page 6: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Amdahl’s Law

If the parallel overhead κ(n, p) is neglected, then

Speedup Ψ(n, p) ≤σ(n) + ϕ(n)

σ(n) + ϕ(n)/p

Suppose we know the inherently sequential portion of the computation,

f =σ(n)

σ(n) + ϕ(n)

Can we predict the speedup Ψ(n, p)?

Performance analysis – p. 6

Page 7: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Amdahl’s Law

Note that f = σ(n)σ(n)+ϕ(n) means

1− f =ϕ(n)

σ(n) + ϕ(n)

Therefore

Ψ(n, p) ≤σ(n) + ϕ(n)

σ(n) + ϕ(n)/p

=1

σ(n)σ(n)+ϕ(n) +

ϕ(n)σ(n)+ϕ(n)

1p

=1

f + (1− f)/p

Performance analysis – p. 7

Page 8: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Amdahl’s Law

Amdahl’s Law: if f = σ/(σ + ϕ) is known, then the best achievablespeedup can be estimated as

Ψ ≤1

f + (1− f)/p

Upper limit (when p goes to infinity): Ψ ≤1

f+(1−f)/p < 1f

Performance analysis – p. 8

Page 9: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Amdahl’s Law

Performance analysis – p. 9

Page 10: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 1

If we know that 90% of the computation can be parallelized, what is themaximum speedup we can expect from using 8 processors?

SolutionSince f=10%, Amdahl’s Law tells us for p = 8

Ψ ≤1

0.1 + (1−0.1)8

≈ 4.7

Performance analysis – p. 10

Page 11: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 2

If 25% of the operations in a parallel program must be performedsequentially, what is the maximum speedup achievable?

SolutionThe maximum speedup is

limp→∞

1

0.25 + (1−0.25)p

=1

0.25= 4

Performance analysis – p. 11

Page 12: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 3

Suppose

σ(n) = 18000 + n

ϕ(n) =n2

100

What is the maximum speedup achievable on a problem of sizen = 10000?

Performance analysis – p. 12

Page 13: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 3 (cont’d)

SolutionSince

Ψ(n, p) ≤σ(n) + ϕ(n)

σ(n) + ϕ(n)/p

and we know

σ(10000) = 28, 000

ϕ(10000) = 1, 000, 000

Therefore

Ψ(10000, p) ≤28, 000 + 1, 000, 000

28, 000 + 1, 000, 000/p

Performance analysis – p. 13

Page 14: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Comments about Amdahl’s Law

Parallelization overhead κ(n, p) is ignored by Amdahl’s LawAmdahl’s Law gives a too optimistic estimate of Ψ

The problem size n is constant for p = 1 and increasing p valuesAmdahl’s Law doesn’t consider solving larger problems with moreprocessors

The inherently sequential portion f may greatly decrease when n isincreased

Amdahl’s Law (Ψ < 1f ) can be unnecessarily pessimistic for large

problems

Performance analysis – p. 14

Page 15: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Gustafson–Barsis’s Law

What if we want to solve larger problems when the number of processorsp is increased?

That is, we may not know the computing time needed by a singleprocessor, because the problem size is too big for one processor.

However, suppose we know the fraction of time spent by a parallelprogram (using p processors) on performing inherently sequentialoperations, can we estimate the speedup Ψ?

Performance analysis – p. 15

Page 16: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Gustafson–Barsis’s Law

Definition: s is the fraction of time spent by a parallel computation using pprocessors on performing inherently sequential operations.More specifically,

s =σ(n)

σ(n) + ϕ(n)/p

and

1− s =ϕ(n)/p

σ(n) + ϕ(n)/p

Performance analysis – p. 16

Page 17: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Gustafson–Barsis’s Law

We note

σ(n) = (σ(n) + ϕ(n)/p)s

ϕ(n) = (σ(n) + ϕ(n)/p)(1− s)p

Therefore

Ψ(n, p) ≤σ(n) + ϕ(n)

σ(n) + ϕ(n)/p

=(s+ (1− s)p)(σ(n) + ϕ(n)/p)

σ(n) + ϕ(n)/p

= s+ (1− s)p

= p+ (1− p)s

Performance analysis – p. 17

Page 18: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Gustafson–Barsis’s Law

Given a parallel program solving a problem of size n using p processors,let s denote the fraction of total execution time spent in serial code. Themaximum speedup Ψ achievable is

Ψ ≤ p+ (1− p)s

Performance analysis – p. 18

Page 19: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Comments about Gustafson–Barsis’s Law

Gustafson–Barsis’s Law encourages solving larger problems usingmore processors. The speedup obtained is thus also called scaledspeedup.

If n is large enough for p processors, n is probably too large (withrespect to memory) for a single processor. However, this doesn’tprevent Gustafson–Barsis’s Law from predicting the best achievablespeedup Ψ, when s is known.

Since parallelization overhead κ(n, p) is ignored, Gustafson–Barsis’sLaw may also overestimate the speedup.

Since Ψ ≤ p+ (1− p)s = p− (p− 1)s, so the best achievablespeedup is Ψ ≤ p. The smaller s the better Ψ.

If s = 1, then there is no speedup at all, because Ψ ≤ p+ (1− p) = 1.

Performance analysis – p. 19

Page 20: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 1

An application executing on 64 processors uses 5% of the total time onnon-parallelizable computations. What is the scaled speedup?

SolutionSince s = 0.05, the scaled speedup on 64 processors is

Ψ ≤ p+ (1− p)s = 64 + (1− 64)(0.05) = 64− 3.15 = 60.85

Performance analysis – p. 20

Page 21: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 2

If we want to achieve speedup Ψ = 15000 using p = 16384 processors,what can the maximum allowable value of the serial fraction s be?

SolutionSince

Ψ ≤ p+ (1− p)s = p− (p− 1)s

then

s ≤p−Ψ

p− 1=

16384− 15000

16384− 1≈ 0.084

Performance analysis – p. 21

Page 22: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Karp–Flatt Metric

Both Amdahl’s Law and Gustafson–Barsis’s Law ignore the parallelizationoverhead κ(n, p), they may therefore overestimate the achievablespeedup.

We recall

Parallel execution time T (n, p) = σ(n) + ϕ(n)/p+ κ(n, p)

Sequential execution time T (n, 1) = σ(n) + ϕ(n)

Performance analysis – p. 22

Page 23: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Karp–Flatt Metric

If we consider the parallelization overhead as another kind of “inherentlysequential work”, then we can use Amdahl’s law to experimentallydetermine a “combined” serial fraction e, which is defined as

e(n, p) =σ(n) + κ(n, p)

σ(n) + ϕ(n)

This experimentally determined serial fraction e(n, p) may either stayconstant with respect to p (meaning that the parallelization overhead isnegligible) or increase with respect to p (meaning that parallelizationoverhead deteriorates the speedup).

Performance analysis – p. 23

Page 24: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Karp–Flatt Metric

If we know the actually achieved speedup Ψ(n, p) using p processors, howcan we determine the serial fraction e(n, p)?

Since

T (n, p) = T (n, 1)e+T (n, 1)(1− e)

p

and we know the value of Ψ(n, p), which is defined as

Ψ(n, p) =T (n, 1)

T (n, p)=

T (n, 1)

T (n, 1)e+ T (n,1)(1−e)p

=1

e+ 1−ep

Therefore1

Ψ= e+

1− e

p⇒ e =

1/Ψ− 1/p

1− 1/p

Performance analysis – p. 24

Page 25: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 1

Benchmarking a parallel program on 1, 2, . . . , 8 processors produces thefollowing speedup results:

p 2 3 4 5 6 7 8

Ψ 1.82 2.50 3.08 3.57 4.00 4.38 4.71

What is the primary reason for the parallel program achieving a speedupof only 4.71 on eight processors?

Performance analysis – p. 25

Page 26: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 1 (cont’d)

SolutionWe can use Karp–Flatt Metric to experimentally determine the values ofe(n, p) as

p 2 3 4 5 6 7 8

Ψ 1.82 2.50 3.08 3.57 4.00 4.38 4.71

e 0.10 0.10 0.10 0.10 0.10 0.10 0.10

Since the experimentally determined serial fraction e is not increasing withp, the primary reason for the poor speedup is the large fraction (10%) ofthe computation that is inherently sequential. In other words, paralleloverhead is not the reason for the poor speedup.

Performance analysis – p. 26

Page 27: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 2

Benchmarking a parallel program on 1, 2, . . . , 8 processors produces thefollowing speedup results:

p 2 3 4 5 6 7 8

Ψ 1.87 2.61 3.23 3.73 4.14 4.46 4.71

What is the primary reason for the parallel program achieving a speedupof only 4.71 on eight processors?

Performance analysis – p. 27

Page 28: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Example 2 (cont’d)

SolutionWe can use Karp–Flatt Metric to experimentally determine the values ofe(n, p) as

p 2 3 4 5 6 7 8

Ψ 1.87 2.61 3.23 3.73 4.14 4.46 4.71

e 0.070 0.075 0.080 0.085 0.090 0.095 0.1

Since the experimentally determined serial fraction e is steadily increasingwith p, parallel overhead is also a contributing factor to the poor speedup.

Performance analysis – p. 28

Page 29: Performance analysis · Karp–Flatt Metric If we consider the parallelization overhead as another kind of “inherently sequential work”, then we can use Amdahl’s law to experimentally

Exercises

Write an MPI program that first uses a master process to read in aJPEG image. (The existing C code collectionhttp://heim.ifi.uio.no/xingca/inf-verk3830/simple-jpeg.tar.gz

can be used.) Then the master process divides the image anddistributes the pieces to the other MPI processes. The purpose is toinvolve all the MPI processes to calculate the average pixel value ofthe entire image, and the standard deviation value.

Suppose a 2D uniform grid is divided into P ×Q rectangular pieces.Each piece is assigned to one MPI process, and that two and twoneighboring processes (in both x- and y-direction) need to exchangethe values on their outermost boundary-layer points. Can you makeuse of the following two MPI functions:MPI Type vector and MPI Type committo help simplify communications in the x-direction?(Hint: you should make use of an online MPI manual.)

Performance analysis – p. 29