performance evaluation of parallel processing. why performance?
TRANSCRIPT
Performance Evaluation ofParallel Processing
Why Performance?
Models of SpeedupSpeedup
Scaled Speedup◦Parallel processing gain over
sequential processing, where problem size scales up with computing power (having sufficient workload/parallelism)
TimeExecutionParallelTimeExecutionorUniprocesspS
Performance Evaluation of Parallel Processing
Speedup
Ts =time for the best serial algorithm
Tp=time for parallel algorithm using p processors
p
sp T
TS
Example
Processor 1
time
100
time
1 2 3 4
25 25 25 25 time
1 2 3 4
35 35 35 35
(a) (b) (c)
ationparallelizperfect
,0.425
100pS
10 iscost synch but
balancing loadperfect
,85.235
100pS
Example (cont.)
time
1 2 3 4
30 20 40 10 time
1 2 3 4
50 50 50 50
(d) (e)
imbalance loadbut
synch no
,5.240
100pS
costsynch and
imbalance load
,0.250
100pS
What Is “Good” Speedup?
Linear speedup:
Superlinear speedup
Sub-linear speedup:
pS p
pS p
pS p
Speedup
p
speedup
Ideal Speedup in Multiprocessor System• Linear
Linear speedup─ the execution time of program on an n-processor system would be l/nth of the execution time on a one-processor system
Limitations
• Interprocessor communication• Synchronization• Load Balancing
Limitations of Interprocessor communicationWhenever one processor generates (computes)
avalue that is needed by the fraction of theprogram running on another processor, thatvalue must be communicated to the processorsthat need it, which takes time
On a uniprocessor system, the entire program
runs on one processor, so there is no time lost to
interprocessor communication
Limitations of Synchronization It is often necessary to
synchronize the processors to ensure that they have all completed some phase of the program before any processor begins working on the next phase of the program
Load balancing
In many parallel applications, difficult to
divide the program across the processors
• When each processor working the same
amount of time not possible, some of the
processors complete their tasks early and
are then idle waiting for the others to finish
Superlinear speedups
Achieving speedup of greater than n on nprocessor systems
• Each of the processors in an n-processor
multiprocessor to complete its fraction of the
program in less than l/nth of the program’s
execution time on a uniprocessor
Factors That Limit Speedup
● Software Overhead Even with a completely equivalent algorithm, software overhead arises in the concurrent implementation ● Load Balancing Speedup is generally limited by the speed of the slowest node. So an
important consideration is to ensure that each node performs the same amount of work
● Communication Overhead Assuming that communication and calculation cannot be overlapped, then any time spent communicating the data between processorsdirectly degrades the speedup
CS546
Lecture 5 Page 16
Degradations of Parallel Processing
Unbalanced Workload
Communication Delay
Overhead Increases with the Ensemble Size
Degradations of Distributed Computing
Unbalanced Computing Power and Workload
Shared Computing and Communication Resource
Uncertainty, Heterogeneity, and Overhead Increases with the Ensemble Size
Causes of Superlinear Speedup
Cache size increasedOverhead reducedLatency hiddenRandomized algorithmsMathematical inefficiency of the
serial algorithmHigher memory access cost in
sequential processing• X.H. Sun, and J. Zhu, "Performance Considerations of Shared Virtual Memory Machines," IEEE Trans. on Parallel and Distributed Systems, Nov. 1995
Efficiency
● Speed up does not measure how efficiently
the processors are being used● Is it worth using 100 processors to get a
speedup of 2? ● Efficiency is defined as the ratio of the
speedup and the number of processors required to achieve it
● Efficiency is given by E(P,N) = S(P, N) / P
If the best known serial algorithm takes 8 seconds i.e. Ts = 8, while a parallel algorithm takes 2 seconds using 5 processors, then
Say we have a program containing 100 operations each of which take 1 time unit.If 80 operations can be done in parallel i.e. P = 80 and 20 operations must be done sequentially i.e. S = 20then using 80 processors
find speedup
Speedup metricsThree performance models based on
three speedup metrics are commonly used.
Amdahl’s law -- Fixed problem size Gustafson’s law -- Fixed time speedup Sun-Ni’s law -- Memory Bounding
speedupThree approaches to scalability analysis
are based on• Maintaining a constant efficiency,• A constant speed, and• A constant utilization
Amdahl’s LawThe performance improvement that can be
gained by a parallel implementation is limited by the fraction of time parallelism can actually be used in an application
Let = fraction of program (algorithm) that is serial and cannot be parallelized. For instance:◦ Loop initialization◦ Reading/writing to a single disk◦ Procedure call overhead
Parallel run time is given by
CS546
Lecture 5 Page 23
sp T)p
α(αT
1
Amdahl’s Law
Amdahl’s law gives a limit on speedup in terms of
CS546
Lecture 5 Page 24
pp
TT
TS
p
TTT
ss
sp
ssp
11
)1(
)1(
• Fixed-Size Speedup (Amdahl Law, 67)
CS546
Lecture 5 Page 25
Wp
W1
Wp Wp WpWp
W1 W1 W1 W1
1 2 3 4 5
Number of Processors (p)
Amountof
Work
Tp
T1
Tp Tp Tp
T1T1
Tp
T1 T1
1 2 3 4 5
Number of Processors (p)
Elapsed
Time
Consider the effect of the serial fraction F on the speedup produced for N = 10 and N = 1024.
Comments on Amdahl’s Law The Amdahl’s fraction in practice depends on the
problem size n and the number of processors p An effective parallel algorithm has:
For such a case, even if one fixes p, we can get linear speedups by choosing a suitable large problem size
Scalable speedup Practically, the problem size that we can run for a
particular problem is limited by the time and memory of the parallel computer
CS546
Lecture 5 Page 28
npn as 0),(
nppnp
p
T
TS
p
sp as
),()1(1
Gustafson law
Gustafson defined two “more relevant” notions of
speedup» Scaled speedup» Fixed-time speedup» And renamed Amdahl’s version
as “fixed-size” speedup
Gustafson’s Law
Gustafson’s Law : Scaling for Higher Accuracy?
The problem size (workload) is fixed and cannot scale to
match the available computing power as the machine size
increases.
Thus, Amdahl’s law leads to a diminishing return
when a larger system is employed to solve a small problem.
The sequential bottleneck in Amdahl’s law can be alleviated
by removing the restriction of a fixed problem size.
Gustafson’s proposed a fixed time concept that achieves an
improved speedup by scaling problem size with the increase
in machine size