performance evaluation of parallel processing. why performance?

Performance Evaluation ofParallel Processing

Why Performance?

Models of SpeedupSpeedup

Scaled Speedup◦Parallel processing gain over

sequential processing, where problem size scales up with computing power (having sufficient workload/parallelism)

TimeExecutionParallelTimeExecutionorUniprocesspS

Performance Evaluation of Parallel Processing

Speedup

Ts =time for the best serial algorithm

Tp=time for parallel algorithm using p processors

p

sp T

TS

Example

Processor 1

time

100

time

1 2 3 4

25 25 25 25 time

1 2 3 4

35 35 35 35

(a) (b) (c)

ationparallelizperfect

,0.425

100pS

10 iscost synch but

balancing loadperfect

,85.235

100pS

Example (cont.)

time

1 2 3 4

30 20 40 10 time

1 2 3 4

50 50 50 50

(d) (e)

imbalance loadbut

synch no

,5.240

100pS

costsynch and

imbalance load

,0.250

100pS

What Is “Good” Speedup?

Linear speedup:

Superlinear speedup

Sub-linear speedup:

pS p

pS p

pS p

Speedup

p

speedup

Ideal Speedup in Multiprocessor System• Linear

Linear speedup─ the execution time of program on an n-processor system would be l/nth of the execution time on a one-processor system

Limitations

• Interprocessor communication• Synchronization• Load Balancing

Limitations of Interprocessor communicationWhenever one processor generates (computes)

avalue that is needed by the fraction of theprogram running on another processor, thatvalue must be communicated to the processorsthat need it, which takes time

On a uniprocessor system, the entire program

runs on one processor, so there is no time lost to

interprocessor communication

Limitations of Synchronization It is often necessary to

synchronize the processors to ensure that they have all completed some phase of the program before any processor begins working on the next phase of the program

Load balancing

In many parallel applications, difficult to

divide the program across the processors

• When each processor working the same

amount of time not possible, some of the

processors complete their tasks early and

are then idle waiting for the others to finish

Superlinear speedups

Achieving speedup of greater than n on nprocessor systems

• Each of the processors in an n-processor

multiprocessor to complete its fraction of the

program in less than l/nth of the program’s

execution time on a uniprocessor

Factors That Limit Speedup

● Software Overhead Even with a completely equivalent algorithm, software overhead arises in the concurrent implementation ● Load Balancing Speedup is generally limited by the speed of the slowest node. So an

important consideration is to ensure that each node performs the same amount of work

● Communication Overhead Assuming that communication and calculation cannot be overlapped, then any time spent communicating the data between processorsdirectly degrades the speedup

CS546

Lecture 5

Degradations of Parallel Processing

Unbalanced Workload

Communication Delay

Overhead Increases with the Ensemble Size

Degradations of Distributed Computing

Unbalanced Computing Power and Workload

Shared Computing and Communication Resource

Uncertainty, Heterogeneity, and Overhead Increases with the Ensemble Size

Causes of Superlinear Speedup

Cache size increasedOverhead reducedLatency hiddenRandomized algorithmsMathematical inefficiency of the

serial algorithmHigher memory access cost in

sequential processing• X.H. Sun, and J. Zhu, "Performance Considerations of Shared Virtual Memory Machines," IEEE Trans. on Parallel and Distributed Systems, Nov. 1995

http://www.cs.iit.edu/~scs/psfiles/ieee95.pdf

Efficiency

● Speed up does not measure how efficiently

the processors are being used● Is it worth using 100 processors to get a

speedup of 2? ● Efficiency is defined as the ratio of the

speedup and the number of processors required to achieve it

● Efficiency is given by E(P,N) = S(P, N) / P

If the best known serial algorithm takes 8 seconds i.e. Ts = 8, while a parallel algorithm takes 2 seconds using 5 processors, then

Say we have a program containing 100 operations each of which take 1 time unit.If 80 operations can be done in parallel i.e. P = 80 and 20 operations must be done sequentially i.e. S = 20then using 80 processors

find speedup

Speedup metricsThree performance models based on

three speedup metrics are commonly used.

Amdahl’s law -- Fixed problem size Gustafson’s law -- Fixed time speedup Sun-Ni’s law -- Memory Bounding

speedupThree approaches to scalability analysis

are based on• Maintaining a constant efficiency,• A constant speed, and• A constant utilization

Amdahl’s LawThe performance improvement that can be

gained by a parallel implementation is limited by the fraction of time parallelism can actually be used in an application

Let = fraction of program (algorithm) that is serial and cannot be parallelized. For instance:◦ Loop initialization◦ Reading/writing to a single disk◦ Procedure call overhead

Parallel run time is given by

CS546

Lecture 5

sp T)p

α(αT

1

Amdahl’s Law

Amdahl’s law gives a limit on speedup in terms of

CS546

Lecture 5

pp

TT

TS

p

TTT

ss

sp

ssp

11

)1(

)1(

• Fixed-Size Speedup (Amdahl Law, 67)

CS546

Lecture 5

Wp

W1

Wp Wp WpWp

W1 W1 W1 W1

1 2 3 4 5

Number of Processors (p)

Amountof

Work

Tp

T1

Tp Tp Tp

T1T1

Tp

T1 T1

1 2 3 4 5

Number of Processors (p)

Elapsed

Time

Consider the effect of the serial fraction F on the speedup produced for N = 10 and N = 1024.

Comments on Amdahl’s Law The Amdahl’s fraction in practice depends on the

problem size n and the number of processors p An effective parallel algorithm has:

For such a case, even if one fixes p, we can get linear speedups by choosing a suitable large problem size

Scalable speedup Practically, the problem size that we can run for a

particular problem is limited by the time and memory of the parallel computer

CS546

Lecture 5

npn as 0),(

nppnp

p

T

TS

p

sp as

),()1(1

Gustafson law

Gustafson defined two “more relevant” notions of

speedup» Scaled speedup» Fixed-time speedup» And renamed Amdahl’s version

as “fixed-size” speedup

Gustafson’s Law

Gustafson’s Law : Scaling for Higher Accuracy?

The problem size (workload) is fixed and cannot scale to

match the available computing power as the machine size

increases.

Thus, Amdahl’s law leads to a diminishing return

when a larger system is employed to solve a small problem.

The sequential bottleneck in Amdahl’s law can be alleviated

by removing the restriction of a fixed problem size.

Gustafson’s proposed a fixed time concept that achieves an

improved speedup by scaling problem size with the increase

in machine size

performance evaluation of parallel processing. why performance?

Documents