programming for performancegerndt/home/teaching/ppe/6... · programming for performance ... lock...

Programming for Performance

Prof. Dr. Michael Gerndt

Lehrstuhl für Rechnertechnik und

Rechnerorganisation/Parallelrechnerarchitektur

Speedup Limited by Overheads

Sequential Work

Max (Work + Synch Time + Comm Cost + Extra Work)Speedup <

Load Balance

• Limit on speedup:

• Work includes data access and other costs

• Not just equal work, but must be busy at same time

• Four parts to load balance

1. Identify enough concurrency

2. Decide how to manage it

3. Determine the granularity at which to exploit it

4. Reduce serialization

ProcessoranyonWorkMax

WorkSequential)(Speedup p

Reducing Synch Time

• Reduce wait time due to load imbalance

• Reduce synchronization overhead

Reducing Synchronization Overhead

• Event synchronization

• Reduce use of conservative synchronization

– e.g. point-to-point instead of barriers, or granularity of pt-to-pt

• But fine-grained synch more difficult to program, more synch

• Mutual exclusion

• Separate locks for separate data

– e.g. locking records in a database: lock per process, record, or

– lock per task in task queue, not per queue

– finer grain => less contention/serialization, more space, less

• Smaller, less frequent critical sections

– don’t do reading/testing in critical section, only modification

– e.g. searching for task to dequeue in task queue, building tree

Implications of Load Balance/Synchronization

• Extends speedup limit expression to:

• Generally, responsibility of software

• Architecture can support task stealing and

synchronization efficiently

• Fine-grained communication, low-overhead access to queues

– efficient support allows smaller tasks, better load balance

• Accessing shared data in the presence of task stealing

– need to access data of stolen tasks

– Hardware shared address space advantageous

)Synch time Work (Max

WorkSequential)(Speedup

Reducing Inherent Communication

• Communication is expensive!

• Measure: communication to computation ratio

• Focus here on inherent communication

• Determined by assignment of tasks to processes

• Actual communication can be greater

• Assign tasks that access same data to same process

• Solving communication and load balance NP-hard in

general case

• But simple heuristic solutions work well in practice

• Applications have structure!

Sequential Work

Max (Work + Synch Time + Comm Cost)Speedup <

Reducing Extra Work

• Common sources of extra work:

• Computing a good partition

• Using redundant computation to avoid communication

• Task, data and process management overhead

– applications, languages, runtime systems, OS

• Imposing structure on communication

– coalescing messages, allowing effective naming

• Architectural implications:

• Reduce need by making communication and orchestration

efficient

Sequential Work

Max (Work + Synch Wait Time + Comm Cost + Extra Work)Speedup <

A Lot Depends on Sizes

• Application parameters and no. of procs affect

inherent properties

• Load balance, communication, extra work, temporal and

spatial locality

• Memory hierarchy

• Interactions with organization parameters of extended

memory hierarchy affect artifactual communication and

performance

• Effects often dramatic, sometimes small: application-

dependent

A Lot Depends on Sizes

1 4 7 10 13 16 19 22 25 28 310

Number of processors Number of processors

N = 130

N = 258

N = 514

N = 1,026

1 4 7 10 13 16 19 22 25 28 310

30 Origin—16 K

Origin—64 K

Origin—512 K

Challenge—16 K

Challenge—512 K

Ocean Barnes-Hut

Measuring Performance

• Absolute performance

• Performance = Work / Time

• Most important to end user

• Performance improvement due to parallelism

• Speedup(p) = Performance(p) / Performance(1)

• Both should be measured

• Work is determined by input configuration of the problem

• If work is fixed,can measure performance as 1/Time

– Or retain explicit work measure (e.g. transactions/sec, bonds/sec)

– Still w.r.t particular configuration, and still what’s measured is

• Speedup(p) = or

Time(1)

Time(p)

Operations Per Second (p)

Operations Per Second (1)

Scaling: Why Worry?

• Fixed problem size is of limited usefulness

• Too small a problem:

• May be appropriate for small machine

• Parallelism overheads begin to dominate benefits for larger

machines

– Load imbalance

– Communication to computation ratio

• May even achieve slowdowns

• Doesn’t reflect real usage, and inappropriate for large

machines

– Can exaggerate benefits of architectural improvements,

especially when measured as percentage improvement in

performance

• Too large a problem

• Difficult to measure improvement (next)

Too Large a Problem

• Suppose problem realistically large for big machine

• May not “fit” in small machine

• Can’t run

• Thrashing to disk

• Working set doesn’t fit in cache

• Fits at some p, leading to superlinear speedup

• Finally, users want to scale problems as machines

• Can help avoid these problems

Demonstrating Scaling Problems

• Small Ocean and big equation solver problems on SGI

Origin2000

Number of processors Number of processors

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

30 Ideal

Ocean: 258 x 258

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310

Grid solver: 12 K x 12 K

Questions in Scaling

• Under what constraints to scale the application?

• What are the appropriate metrics for performance

improvement?

– work is not fixed any more, so time not enough

• How should the application be scaled?

• Definitions:

• Scaling a machine: Can scale power in many ways

– Assume adding identical nodes, each bringing memory

• Problem size: Vector of input parameters, e.g. N = (n, q, Dt)

– Determines work done

– Distinct from data set size and memory usage

– Start by assuming it’s only one parameter n, for simplicity

Scaling Models

• Problem constrained (PC)

• Memory constrained (MC)

• Time constrained (TC)

Problem Constrained Scaling

• User wants to solve same problem, only faster

• Video compression

• Computer graphics

• VLSI routing

• But limited when evaluating larger machines

)1()(Speedup PC

Time Constrained Scaling

• Execution time is kept fixed as system scales

• Example: User has fixed time to use machine or wait for result

• Performance = Work/Time as usual, and time is fixed,

• How to measure work(p)?

• Execution time on a single processor? (thrashing problems)

• The work metric should be easy to measure, ideally analytical.

• Should scale linearly with sequential complexity

– Or ideal speedup will not be linear in p (e.g. no. of rows, no of

points, no. of operations in matrix program)

• If we cannot find an intuitive application measure, as often

true, measure execution time with ideal memory system on a

uniprocessor.

pWorkpSpeedupTC

Memory Constrained Scaling (1)

• Scale so memory usage per processor stays fixed

• Speedup can not be defined as Time(1) / Time(p) for

scaled up problem since time(1) is hard to measure

and inappropriate

• Insert performance=work/time in speedup formula

TimeinIncrease

WorkinIncrease

pWorkpSpeedupMC

Memory Constrained Scaling (2)

• MC scaling can lead to large increases in execution

• If work grows faster than linearly in memory usage

• e.g. matrix factorization with complexity n³

– 10,000-by 10,000 matrix takes 800MB and 1 hour on

uniprocessor

– With 1,000 processors, can run 320K-by-320K matrix, but ideal

parallel time grows to 32 hours!

– With 10,000 processors, 100 hours ...

Scaling Down Problem Parameters

• Some parameters don’t affect parallel performance

much, but do affect runtime, and can be scaled down

• Common example is no. of time-steps in many scientific

applications

– need a few to allow settling down, but don’t need more

– may need to omit cold-start when recording time and statistics

• First look for such parameters

• But many application parameters affect key

characteristics

• Scaling them down requires scaling down no. of processors

• Otherwise can obtain highly unrepresentative behavior

Difficulties in Scaling N, p Representatively

• Want to preserve many aspects of full-scale scenario

• Distribution of time in different phases

• Key behavioral characteristics

• Scaling relationships among application parameters

• Contention and communication patterns

• Can’t really hope for full representativeness, but can

• Cover range of realistic operating points

• Avoid unrealistic scenarios

• Gain insights and estimates of performance

Performance Analysis Process

Measurement

Analysis

Ranking

Refinement

Coding

Performance Analysis

Production

Program Tuning

Performance Prediction and Benchmarking

• Performance analysis determines the performance on

a given machine.

• Performance prediction allows to evaluate programs

for a hypthetical machine. It is based on:

• runtime data of an actual execution

• machine model of the target machine

• analytical techniques

• simulation techniques

• Benchmarking determines the performance of a

computer system on the basis of a set of typical

applications.

Overhead Analysis

• How to decide whether a code performs well:

• Comparison of measured MFLOPS with peak performance

• Comparison with a sequential version

• Estimate distance to ideal

time via overhead classes

– tmem

– tcomm

– tsync

– tred

– ...

11 #processors

speedup

t)p(speedup

The Basics

• Successful tuning is a combination of

• right algorithms and libraries

• compiler flags and directives

• thinking!

• Measurement is better than guessing:

• to determine performance problems

• to validate tuning decisions and optimization

• Measurement should be repeated after each

significant code modification and optimizations

The Basics

• Do I have a performance problem at all?

• Compare MFlops/MOps to typical rate

• Speedup measurements

• What are the hot code region?

• Flat profiling

• Is there a bottleneck in those regions?

• Single node: Hardware counter profiling

• Parallel: Synchronization and communication analysis profiling

• Does the bottleneck vary over time or processor space?

• Profiling individual processes and/or threads

• Tracing

• Does the code behave similar for different configurations?

• Analyze runs with different processor counts

• Analyze different input configurations

Instrumentation Analysis

Execution

refinement

Current Hypotheses

Requirements Performance Data

Detected Bottlenecks

Instr: DatISPEC

Info: HypDat

Prove: HypDat{T,F}

Refine: HypPHyp

Performance Measurement Techniques

• Event model of the execution

• Events occur at a processor at a specific point in time

• Events belong to event types

– clock cycles

– cache misses

– remote references

– start of a send operation

– ...

• Profiling: Recording accumulated performance data for

events

• Sampling: Statistical approach

• Instrumentation: Precise measurement

• Tracing: Recording performance data of individual

events

Statistical Sampling

Program Main...

end Main

Function Asterix (...)...

end Asterix

Function Obelix (...)...

end Obelix...

program counter

cycle counter

cache miss counter

flop counter

Asterix

Obelix +

Function Table

interrupt every10 ms

add and reset counter

...Function Obelix (...)

call monitor(“Obelix“, “enter“)...

call monitor(“Obelix“,“exit“)end Obelix

monitor(routine, location)if (“enter“) then

end if

Function Table

Instrumentation and Monitoring

cache miss counter

Asterix

Obelix + - 1020013001490

Instrumentation Techniques

• Source code instrumentation

• done by the compiler, source-to-source tool, or manually

– portability

– link back to source code easy

– re-compile necessary when instrumentation is changed

– difficult to instrument mixed-code applications

– cannot instrument system or 3rd party libraries or executables

• Binary instrumentation

• „patching“ the executable to insert hooks (like a debugger)

– inverse pros/cons

• Offline

• Online

Instrumentation Tools

• Standard compilers

• Add callbacks for profiling functions

• Typically an function level

• Be careful of overhead for frequently called functions

• gcc, for example, adds calls if –finstrument-functions

is given.

• OPARI

• Jülich Supercomputing Center

• OpenMP for C and FORTRAN

• Source-level instrumentation see exercise

• PMPI interface

• Library interposition

• Link own library before real library, e.g. frequently used for

own malloc function.

Instrumentation Tools

• TAU Generic Instrumenter

• Parsers for C++, FORTRAN, UPC,…

• Creation of PTD (Program Database Toolkit)

• Approach

– Specify which string to insert before and after certain regions

– Use provided variables to access file and line information

• Limited program region types

• tau.oregon.edu

• OMPT

• Proposal for profiling API

• Based on callbacks

Source Code Transformation Tools

• Rose

• rosecompiler.org, LLNL

• LLVM

• Language independent code optimizer

and code generator

• http://www.llvm.org/, Univ. Illinois

• Clang C frontend for LLVM, http://clang.llvm.org/

• C/C++ and Objective C/C++

• Open64

• www.open64.net

• Compiler infrastructure based originally on the SGI compiler.

• Interprocecural and loop optimizations

Binary Instrumentation Tools

• Dyninst

• Dynamic instrumentation on binary level

• Context pf Paradyn project

• Univ. Wisconsin-Madison, Maryland

• Bart Miller, Jeff Hollingsworth

• Intel Pin

• Intel for x86

• Online instrumentation of binaries

• Valgrind

• Dynamic instrumentation

• Based on emulation of x86 machine instructions

Tr P n-1

Trace P1

Tracing

...Function Obelix (...)

call monitor(“Obelix“, “enter“)...

call monitor(“Obelix“,“exit“)end Obelix

MPI LibraryFunction MPI_send (...)

call monitor(“MPI_send“, “enter“)...

call PMPI_send(...)

call monitor(“MPI_send“,“exit“)end Obelix

Process 0

Process 1

Process n-1

Trace P0

10.4 P0 Obelix enter

10.6 P0 MPI_Send enter

10.8 P0 MPI_Send exit

Tr P n-1

Trace P1

Merging

Trace P0

Merge Process

P0 - Pn-1

10.7 P1 MPI_Recv enter

11.0 P1 MPI_Recv exit

Visualization of Dynamic Behaviour

P0 - Pn-1

10.7 P1 MPI_Recv enter

11.0 P1 MPI_Recv exit

10.4 10.5 10.6 10.7 10.8 10.9 11.0

Timeline Visualization

Obelix

Obelix MPI_Recv

MPI_Send Obelix

Profiling vs Tracing

• Profiling

• recording summary information (time, #calls,#misses...)

• about program entities (functions, objects, basic blocks)

• very good for quick, low cost overview

• points out potential bottlenecks

• implemented through sampling or instrumentation

• moderate amount of performance data

• Tracing

• recording information about events

• trace record typically consists of timestamp, processid, ...

• output is a trace file with trace records sorted by time

• can be used to reconstruct the dynamic behavior

• creates huge amounts of data

• needs selective instrumentation

Program Monitors

• Each PA tools has its own monitor

• Score-P

• In the last years, Score-P was developed by tools groups of

Scalasca, Vampir and Periscope.

• Provides support for

– MPI, OpenMP, CUDA

– Profiling and tracing

– Callpath profiles

– Online Access Interface

• Cube 4 profiling data format

• OTF2 (Open Trace Format)

Instrumentation Analysis

Execution

refinement

Current Hypotheses

Requirements Performance Data

Detected Bottlenecks

Instr: DatISPEC

Info: HypDat

Prove: HypDat{T,F}

Refine: HypPHyp

Common Performance Problems with MPI

• Single node performance

• Excessive number of 2nd-level cache misses

• Low number of issued instructions

• IO

• High data volume

• Sequential IO due to IO subsystem or sequentialization in the

program

• Excessive communication

• Frequent communication

• High data volume

Common Performance Problems with MPI

• Frequent synchronization

• Reduction operations

• Barrier operations

• Load balancing

• Wrong data decomposition

• Dynamically changing load

Common Performance Problems with SM

• Single node performance

• ...

• IO

• ...

• Excessive communication

• Large number of remote memory accesses

• False sharing

• False data mapping

• Frequent synchronization

• Implicit synchronization of parallel constructs

• Barriers, locks, ...

• Load balancing

• Uneven scheduling of parallel loops

• Uneven work in parallel sections

Analysis Techniques

• Offline vs Online Analysis

• Offline: first generate data then analyze

• Online: generate and analyze data while application is running

• Online requires automationlimited to standard bottlenecks

• Offline suffers more from size of measurement information

• Three techniques to support user in analysis

• Source-level presentation of performance data

• Graphical visualization

• Ranking of high-level performance properties

Statistical Profiling based Tools

• Gprof – GNU profiling tool

• Time profiling

• Inclusive and exclusive time

• Flat profile

• Call graph profile

• Based on instrumentation of

function entry and exit

• Records were the call is

coming from.

Statistical Profiling based Tools

• Allinea MAP

• Annotations to the application source code.

• Based on time series of profiles

• For parallel applications it indicates outlying processes.

Profiling Tools based on Instrumentation

• TAU (Tuning and Analysis

Utilities)

• Measurements are based on

instrumentation

• Visualization via paraprof

– Graphical display for aggregated and

per node, context, or thread

– Topology views of performance data

• Scalasca

• Cube performance visualizer

• Profiles based on Score-P

• Call-path profiling

Trace-based Analysis Tools

• Vampir

• Graphical views presenting

events and summary data

• Flexible scrolling and

zooming features

• OTF2 trace format

generated by Score-P

• Commercial license

• www.vampire.eu

Trace-based Analysis Tools

• Paraver

• Barcelona Supercomputing

Center

• MPI, OMP, pthreads, OmpSs,

• http://www.bsc.es/computer-sciences/performance-tools/paraver

• Clustering of program phases, i.e. segments between MPI calls

• Recently tracking of clusters in time series of profiles based on

object tracking

Automatic Analysis Tools

• Paradyn

• University of Wisconsin Madison

• Periscope

• TU München

• Automatic detection of formalized performance properties

• Profile data

• Distributed online tool

• Scalasca

• Search for performance patterns in traces

• Post-mortem on parallel resources of the application

• Visualization of patterns in CUBE

programming for performancegerndt/home/teaching/ppe/6... · programming for performance ... lock...

Documents

ibm ims tooling for transaction management · full queue...

queue, deque, and priority queue implementations

implementasi dan analisa per connection queue (pcq

lockformer / button punch snap lock machine€¦ ·...

personalsiteofannchong.weebly.com · exhibit 13.4 finite...

rcv-queue cos-map - cisco · the rcv-queue queue-limit...

practical lock picking - per ira ad libertas - rage...

but not simpler arseny “zeux” kapoulkine · 2020. 12....

reducing queue lock pessimism in multiprocessor...

a portable lock-free bounded...

taxi queue, passenger queue or no queue? - … · taxi...

a persistent lock-free queue for non-volatile...

queue, deque, and priority queue implementations chapter 23

integrating lock-free and combining techniques for a ... ·...

pre post routing input output routing q qualitytraining ·...

queue system and zend\queue implementation

no slide titleshanir/tl2-tv06.… · ppt file · web...

gnat reference manual - gnu projectvi 3.25 aspect lock...

1 reducing queue lock pessimism in multiprocessor...

1 queues what is a queue? queue implementations: –queue as...