eecc756 - shaaban #1 lec # 10 spring2000 4-13-2000 parallel system performance: evaluation &...

EECC756 - ShaabanEECC756 - Shaaban#1 lec # 10 Spring2000 4-13-2000

Parallel System Performance: Parallel System Performance: Evaluation & ScalabilityEvaluation & Scalability

• Factors affecting parallel system performance:– Algorithm-related, parallel program related, architecture/hardware-

related.

• Workload-Driven Quantitative Architectural Evaluation:– Select applications or suite of benchmarks to evaluate architecture either

on real or simulated machine.– From measured performance results compute performance metrics:

• Speedup, System Efficiency, Redundancy, Utilization, Quality of Parallelism.

– Application Models of Parallel Computer Models: How the speedup of an application is affected subject to specific constraints:

• Fixed-load Model.• Fixed-time Model.• Fixed-Memory Model.

• Performance Scalability:– Definition.– Conditions of scalability.– Factors affecting scalability.


Factors affecting Parallel System PerformanceFactors affecting Parallel System Performance• Parallel Algorithm-related:

– Available concurrency and profile, grain, uniformity, patterns.– Required communication/synchronization, uniformity and patterns.– Data size requirements.– Communication to computation ratio.

• Parallel program related:– Programming model used.– Resulting data/code memory requirements, locality and working set

characteristics.– Parallel task grain size.– Assignment: Dynamic or static.– Cost of communication/synchronization.

• Hardware/Architecture related:– Total CPU computational power available.– Shared address space Vs. message passing.– Communication network characteristics.– Memory hierarchy properties.


Performance Evaluation of UniprocessorsPerformance Evaluation of Uniprocessors• Design modification decisions made only after quantitative

performance evaluation.

• For existing systems: Comparison of results.

• For future systems: Careful extrapolation from known quantities

• Wide base of programs leads to standard benchmarks:– Measured on wide range of machines and successive generations

• Measurements and technology assessment lead to proposed features.

• Simulation of new features designs:– Simulator developed that can run with and without a feature– Benchmarks run through the simulator to obtain results– Together with cost and complexity, decisions made.


Performance Isolation: MicrobenchmarksPerformance Isolation: Microbenchmarks

• Microbenchmarks: Small, specially written programs to isolate performance characteristics– Processing.

– Local memory.

– Input/output.

– Communication and remote access (read/write, send/receive)

– Synchronization (locks, barriers).

– Contention.


Types of Workloads/BenchmarksTypes of Workloads/Benchmarks– Kernels: matrix factorization, FFT, depth-first tree search

– Complete Applications: ocean simulation, crew scheduling, database.

– Multiprogrammed Workloads.

• Multiprog. Appls Kernels Microbench.

Realistic ComplexHigher level interactionsAre what really matters

Easier to understandControlledRepeatableBasic machine characteristics

Each has its place:

Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance


Sample Workload/Benchmark SuitesSample Workload/Benchmark Suites• Numerical Aerodynamic Simulation (NAS)

– Originally pencil and paper benchmarks

• SPLASH/SPLASH-2– Shared address space parallel programs

• ParkBench– Message-passing parallel programs

• ScaLapack– Message-passing kernels

• TPC– Transaction processing

– SPEC-HPC• . . .


Multiprocessor SimulationMultiprocessor Simulation• Simulation runs on a uniprocessor (can be parallelized too)

– Simulated processes are interleaved on the processor

• Two parts to a simulator:– Reference generator: plays role of simulated processors

• And schedules simulated processes based on simulated time

– Simulator of extended memory hierarchy• Simulates operations (references, commands) issued by reference

generator

• Coupling or information flow between the two parts varies– Trace-driven simulation: from generator to simulator– Execution-driven simulation: in both directions (more accurate)

• Simulator keeps track of simulated time and detailed statistics.


Execution-Driven SimulationExecution-Driven Simulation• Memory hierarchy simulator returns simulated time information to

reference generator, which is used to schedule simulated processes.

P1

P2

P3

Pp

$1

$2

$3

$p

Mem1

Mem2

Mem3

Memp

Reference generator Memory and interconnect simulator

···

···

Network


Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited • Degree of Parallelism (DOP): For a given time period, reflects

the number of processors in a specific parallel computer actually executing a particular parallel program.

• Average Parallelism: – Given maximum parallelism = m

– n homogeneous processors

– Computing capacity of a single processor – Total amount of work (instructions, computations:

or as a discrete summation W ii

i

m

t .

1

W DOP t dtt

t

( )1

2

A DOP t dtt t t

t

1

2 1 1

2

( )A i

ii

m

ii

m

t t

.

1 1

ii

m

t t t

12 1Where ti is the total time that DOP = i and

The average parallelism A:

In discrete form


Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited

Asymptotic Speedup:Asymptotic Speedup:

T

Ti

T

Ti

ii

mi

i

m

ii

mi

i

m

ii

m

i

m

t W

t W

SW

W i

( ) ( )

( ) ( )

( )

( )

1 1

1

1 1

1 1

1

1

Execution time with one processor

Execution time with an infinite number of available processors

Asymptotic speedup S


Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited• Arithmetic Mean Performance:

Let Ri be the execution rates (instructions per second) of m programs or benchmarks i = 1, 2 … m

– With equal weighting:

– With distribution = (fi i = 1,2, …, m)

• Geometric Mean Performance:

– With distribution = (fi i = 1,2, …, m)

a ii

m

R R m

1

a iR f Rii

m*

1

g im

i

m

R R

1

1

g ii

m

R R if*

1


Harmonic Mean PerformanceHarmonic Mean Performance• Arithmetic mean execution time per instruction:

• The harmonic mean execution rate across m benchmark programs:

• Weighted harmonic mean execution rate with weight distribution = {fi|i = 1, 2, …, m}

• Harmonic Mean Speedup for a program with n parallel execution modes:

a ii

m

ii

m

T T Rm m

1 1 1

1 1

ha ii

mR T Rm

1

11

h

i ii

mRf R

*

1

1

S T Tf Ri ii

n 1

1

1*


Harmonic Mean Speedup for Harmonic Mean Speedup for nn Execution Mode Multiprocessor systemExecution Mode Multiprocessor system

Fig 3.2 page 111See handout


Efficiency, Utilization, Redundancy, Quality of ParallelismEfficiency, Utilization, Redundancy, Quality of Parallelism• System Efficiency: Let O(n) be the total number of unit operations

performed by an n-processor system and T(n) be the execution time in unit time steps:

– Speedup factor: S(n) = T(1) /T(n)

– System efficiency for an n-processor system: E(n) = S(n)/n = T(1)/[nT(n)]

• Redundancy: R(n) = O(n)/O(1)

• Utilization: U(n) = R(n)E(n) = O(n) /[nT(n)]

• Quality of Parallelism:Q(n) = S(n) E(n) / R(n) = T3(1) /[nT2(n)O(n)]

Parallel Performance Metrics RevisitedParallel Performance Metrics Revisited


Parallel Performance ExampleParallel Performance Example• The execution time T for three parallel programs is given in terms of processor

count P and problem size N• In each case, we assume that the total computation work performed by an

optimal sequential algorithm scales as N+N2 .1 For first parallel algorithm: T = N + N2/P This algorithm partitions the computationally demanding O(N2) component of the

algorithm but replicates the O(N) component on every processor. There are no other sources of overhead.

2 For the second parallel algorithm: T = (N+N2 )/P + 100 This algorithm optimally divides all the computation among all processors but

introduces an additional cost of 100.

3 For the third parallel algorithm: T = (N+N2 )/P + 0.6P2

This algorithm also partitions all the computation optimally but introduces an additional cost of 0.6P2.

• All three algorithms achieve a speedup of about 10.8 when P = 12 and N=100 . However, they behave differently in other situations as shown next.

• With N=100 , all three algorithms perform poorly for larger P , although Algorithm (3) does noticeably worse than the other two.

• When N=1000 , Algorithm (2) is much better than Algorithm (1) for larger P .


Parallel Parallel Performance Performance

Example Example (continued)(continued)

All algorithms achieve:Speedup = 10.8 when P = 12 and N=100

N=1000 , Algorithm (2) performs much better than Algorithm (1)for larger P .

Algorithm 1: T = N + N2/PAlgorithm 2: T = (N+N2 )/P + 100Algorithm 3: T = (N+N2 )/P + 0.6P2


A Parallel Performance measures A Parallel Performance measures ExampleExample

• O(1) = T(1) = n3

• O(n) = n3 + n2log2n T(n) = 4n3/(n+3)

Fig 3.4 page 114

Table 3.1 page 115Table 3.2 page 117See handout


Parallel Performance Metrics Revisited: Amdahl’s LawParallel Performance Metrics Revisited: Amdahl’s Law• Harmonic Mean Speedup (i number of processors used):

• In the case w = {fi for i = 1, 2, .. , n} = (, 0, 0, …, 1-), the system is running sequential code with probability and utilizing n processors with probability (1-) with other processor modes not utilized.

Amdahl’s Law:

S 1/ as n Under these conditions the best speedup is upper-bounded by 1/

S T Tf Ri ii

n 1

1

1*

nSn

n

1 1( )


Application Models of Parallel ComputersApplication Models of Parallel Computers• If work load W or problem size s is unchanged then:

– The efficiency E decreases rapidly as the machine size n increases because the overhead h increases faster than the machine size.

• The condition of a scalable parallel computer solving a scalable parallel problems exists when:

– A desired level of efficiency is maintained by increasing the machine size and problem size proportionally.

– In the ideal case the workload curve is a linear function of n: (Linear scalability in problem size).

• Application Workload Models for Parallel Computers:

Bounded by limited memory, limited tolerance to interprocess communication (IPC) latency, or limited I/O bandwidth:

– Fixed-load Model: Corresponds to a constant workload.

– Fixed-time Model: Constant execution time. – Fixed-memory Model: Limited by the memory bound.


The Isoefficiency ConceptThe Isoefficiency Concept• Workload w as a function of problem size s : w = w(s)

• h total communication/other overhead , as a function of problem size s and machine size n, h = h(s,n)

• Efficiency of a parallel algorithm implemented on a given parallel computer can be defined as:

• Isoefficiency Function: E can be rewritten as:

E = 1/(1 + h(s, n)/w(s)). To maintain a constant E, W(s) should grow in proportion to h(s,n) or,

C = E/(1-E) is a constant for a fixed efficiency E.

The isoefficiency function is defined as follows:

If the workload w(s) grows as fast as fE(n) then a constant efficiency

can be maintained for the algorithm-architecture combination.

EW s

W s h s n

( )

( ) ( , )

w sE

Eh s n( ) ( , )

1

Ef n C h s n( ) ( , )


Speedup Performance Laws: Fixed-Workload SpeedupSpeedup Performance Laws: Fixed-Workload SpeedupWhen DOP = i > n (n = number of processors)

n

ii

m

i

i

mSW

WT

T n

iin

( )

( )

1 1

1

iit Wn

i

i

n( )

Execution time of Wi

T ni

i

ni

i

m W( )

1

Total execution time

n

ii

m

i

i

mSW

WT

T n Q n

iin

Q n

( )

( ) ( )( )

1 1

1

If DOP = , then i n n ii i it t W ( ) ( )

Fixed-load speedup factor is defined as the ratio of T(1) to T(n):

Let Q(n) be the total system overheads on an n-processor system:

The overhead delay Q(n) is both application- and machine-dependent and difficult to obtain in closed form.


Amdahl’s Law for Fixed-Load SpeedupAmdahl’s Law for Fixed-Load Speedup• For the special case where the system either operates in

sequential mode (DOP = 1) or a perfect parallel mode (DOP = n), the Fixed-load speedup is simplified to:

We assume here that the overhead factor Q(n)= 0

For the normalized case where:

The equation is reduced to the previously seen form of

Amdahl’s Law:

nn

nS W W

W W n

1

1

1 11 1 1W W W Wn n

with and ( )

nSn

n

1 1( )


Fixed-Time SpeedupFixed-Time Speedup• To run the largest problem size possible on a larger

machine with about the same execution time.

Let be the maximum DOP for the scaled up problem,

be the scaled workload with DOP = i

In general, for and

Assuming that we obtain:

m

i m

T( )=T'(n)

i

i i

W'W' W W' W

'

' 2

11 1

ii

mi

i

m

W Wi

i

nQ n

1 1

''

( )

Speedup is given by: nS T T n' ( ) / '( ) 1

n

ii

m

i

i

m

ii

m

ii

mSW

WW

W

T

T n

iin

Q n'

''

'( )

'( )( )

' '

1 1

1

1

1


Gustafson’s Fixed-Time SpeedupGustafson’s Fixed-Time Speedup• For the special fixed-time speedup case where DOP can

either be 1 or n and assuming Q(n) = 0

n

ii

m

ii

mn

n

n

n

n n n n

SW

W

W WW W

W WW W

W W W W W W

T

T n

n

n n

'' ' '

' ' '

( )

'( )

'

1 1

1

1

1

1

1

1 1Where and

Assuming and and a= -W W W Wn n1 11 1

nST

T n

nn n'

( )

'( )

( )

( )( )

1 1

11


Fixed-Memory SpeedupFixed-Memory Speedup• Let M be the memory requirement of a given problem

• Let W = g(M) or M = g-1(W) where

Wi

i

m

W

1

workload for sequential execution* *

*

W W ii

mn

1

scaled workload on nodes

The memory bound for an active node is

1

1

g W ii

m

The fixed-memory speedup is defined by:

n

ii

i

i

SW

WT

T n

m

iin

Q nm

*

*

*

( )

'( )( )

*

*

1 1

1

Assuming

and either sequential or perfect parallelim and Q(n) = 0

*( ) ( ) ( ) ( )g WnM G n g M G n

n

nn

n

n

nS W W

W WW W

W Wn

G n

G n n*

* *

* */

( )

( ) /

1

1

1

1


Scalability MetricsScalability Metrics• The study of scalability is concerned with determining the

degree of matching between a computer architecture and and an application algorithm and whether this degree of matching continues to hold as problem and machine sizes are scaled up .

• Basic scalablity metrics affecting the scalability of the system for a given problem:

Machine Size n Clock rate f

Problem Size s CPU time T

I/O Demand d Memory Capacity m

Communication overhead h(s, n), where h(s, 1) =0

Computer Cost c

Programming Overhead p


Parallel Scalability MetricsParallel Scalability Metrics

Scalability of An architecture/algorithm Combination

Machine Size Hardware

Cost CPU Time

I/O Demand

Memory Demand

Programming Cost Problem

Size

Communication Overhead


Revised Asymptotic Speedup, EfficiencyRevised Asymptotic Speedup, Efficiency• Revised Asymptotic Speedup:

– s problem size.

– T(s, 1) minimal sequential execution time on a uniprocessor.

– T(s, n) minimal parallel execution time on an n-processor system.

– h(s, n) lump sum of all communication and other overheads.

• Revised Asymptotic Efficiency:

S s nT s

T s n h s n( , )

( , )

( , ) ( , )

1

E s nS s n

n( , )

( , )


Parallel System ScalabilityParallel System Scalability• Scalability (informal restrictive definition):

A system architecture is scalable if the system efficiency E(s, n) = 1 for all algorithms with any number of processors and any size problem s

• Scalability Definition (more formal):

The scalability (s, n) of a machine for a given algorithm is defined as the ratio of the asymptotic speedup S(s,n) on the real machine to the asymptotic speedup SI(s, n)

on the ideal realization of an EREW PRAM

IS

( , )( , )

( , )

( , )

( , )s n

S s n

s n

s n

T s nI

I

ST

II

S Ts n

T s

s n( , )

( , )

( , )

1


Example: Scalability of Network Example: Scalability of Network Architectures for Parity CalculationArchitectures for Parity Calculation

Table 3.7 page 142see handout


Programmability Vs. ScalabilityProgrammability Vs. ScalabilityIn

crea

sed

Sca

labi

lity

Increased Programmability

Ideal Parallel Computers

Message-passingmulticomuter withdistributed memory

Multiprocessor with shared memory


MPPs Scalability IssuesMPPs Scalability Issues• Problems:

– Memory-access latency.

– Interprocess communication complexity or synchronization overhead.

– Multi-cache inconsistency.

– Message-passing overheads.

– Low processor utilization and poor system performance for very large system sizes.

• Possible Solutions:– Low-latency fast synchronization techniques.

– Weaker memory consistency models.

– Scalable cache coherence protocols.

– To relize shared virtual memory.

– Improved software portability; standard parallel and distributed operating system support.


Scalable MachinesScalable Machines• Design Choices:

– Custom-designed or commodity nodes?

– Capability of node-to-network interface

– Supporting programming models?

• What does hardware scalability mean?– Avoids inherent design limits on resources

– Bandwidth increases with machine size P

– Latency does not.

– Cost increases slowly with P.


Limited Scaling of a BusLimited Scaling of a Bus

• Bus: each level of the system design is grounded in the scaling limits at the layers below and assumptions of close coupling between components

Characteristic Bus

Physical Length ~ 1 ft

Number of Connections fixed

Maximum Bandwidth fixed

Interface to Comm. medium memory inf

Global Order arbitration

Protection Virt -> physical

Trust total

OS single

comm. abstraction HW


ScalingScaling of Workstations in a LAN? of Workstations in a LAN?

• No clear limit to physical scaling, no global order, consensus difficult to achieve.

• Independent failure and restart.

Characteristic Bus LAN

Physical Length ~ 1 ft KM

Number of Connections fixed many

Maximum Bandwidth fixed ???

Interface to Comm. medium memory inf peripheral

Global Order arbitration ???

Protection Virt -> physical OS

Trust total none

OS single independent

comm. abstraction HW SW


• What fundamentally limits bandwidth?– single set of wires

• Must have many independent wires• Connect modules through switches• Bus vs Network Switch.

Bandwidth ScalabilityBandwidth Scalability

P M M P M M P M M P M M

S S S S

Typical switches

Bus

Multiplexers

Crossbar


Dancehall MP OrganizationDancehall MP Organization

• Network bandwidth?• Bandwidth demand?

– independent processes?– communicating processes?

• Latency?

Scalable network

P

$

Switch

M

P

$

P

$

P

$

M M

Switch Switch


Generic Distributed Memory Generic Distributed Memory OrganizationOrganization

• Network bandwidth?• Bandwidth demand?

– independent processes?– communicating processes?

• Latency?

Scalable network

CA

P

$

Switch

M

Switch Switch


Key System Scaling PropertyKey System Scaling Property

• Large number of independent communication paths between nodes.

• => allow a large number of concurrent transactions using different wires.

• initiated independently.

• No global arbitration

• Effect of a transaction only visible to the nodes involved– Effects propagated through additional transactions.


Network Latency Scaling ExampleNetwork Latency Scaling Example

• max distance: log n

• number of switches: n log n

• overhead = 1 us, BW = 64 MB/s, 200 ns per hop

• Pipelined• T64(128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us

• T1024(128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us

• Store and Forward• T64

sf(128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us

• T64sf(1024) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us


Cost ScalingCost Scaling

• cost(p,m) = fixed cost + incremental cost (p,m)

• Bus Based SMP?

• Ratio of processors : memory : network : I/O ?

• Parallel efficiency(p) = Speedup(P) / P

• Costup(p) = Cost(p) / Cost(1)

• Cost-effective: speedup(p) > costup(p)

• Is super-linear speedup


nCUBE/2 Machine OrganizationnCUBE/2 Machine Organization

• Entire machine synchronous at 40 MHz

Single-chip node

Basic module

Hypercube networkconfiguration

DRAM interface

DM

Ach

anne

ls

Ro

ute

r

MMU

I-Fetch&

decode

64-bit integerIEEE floating point

Operand$

Execution unit


CM-5 Machine OrganizationCM-5 Machine Organization

Diagnostics network

Control network

Data network

Processingpartition

Processingpartition

Controlprocessors

I/O partition

PM PM

SPARC

MBUS

DRAMctrl

DRAM DRAM DRAM DRAM

DRAMctrl

Vectorunit DRAM

ctrlDRAM

ctrl

Vectorunit

FPU Datanetworks

Controlnetwork

$ctrl

$SRAM

NI

eecc756 - shaaban #1 lec # 10 spring2000 4-13-2000 parallel system performance: evaluation &...

Documents