derived metrics with paraver using hardware counters on power … · 2014. 11. 4. · t h e u n i v...
TRANSCRIPT
TH
E
UN I V E
RS
IT
Y
OF
ED
I N BU
R
GH
Derived Metrics with Paraver using Hardware
Counters on Power 5 Chips
Nicholas Pattakos
October 8, 2008
Contents
1 Introduction 1
1.1 Dissertation structure . . . . . . . . . . . . . . . . . . . . . . . . . . .3
2 Profiling and paraver configuration files 4
2.1 Software performance optimisation . . . . . . . . . . . . . . . . .. . . 4
2.2 Tools for profiling serial applications . . . . . . . . . . . . . .. . . . . 5
2.3 Hardware counters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Parallel profilers and libraries for parallel profiling .. . . . . . . . . . 10
2.5 Description of this project . . . . . . . . . . . . . . . . . . . . . . . .11
3 Working environment 14
3.1 The HPCx super-computing service . . . . . . . . . . . . . . . . . . . 14
3.1.1 Nodes: IBM eServer pSeries p5 575 . . . . . . . . . . . . . . . 14
3.1.2 Power5 chips . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.3 HPCx Interconnect . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Operating system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3 Compiler, libraries and other software . . . . . . . . . . . . . . .. . . 17
3.4 HPM toolkit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4.1 hpmcount . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.2 libHPM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Using paraver to profile codes . . . . . . . . . . . . . . . . . . . . . . 22
3.5.1 Basic paraver tracing . . . . . . . . . . . . . . . . . . . . . . . 23
3.5.2 Paraver instrumentation . . . . . . . . . . . . . . . . . . . . . 25
3.5.3 Paraver’s UI and basic views creation . . . . . . . . . . . . . .27
3.5.4 Creating a configuration file . . . . . . . . . . . . . . . . . . . 30
4 Aim and methodology, test case programmes 32
4.1 Custom written programmes . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Stream Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 LAMMPS code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.4 Using paraver to obtain information on sections of a code. . . . . . . . 36
5 Results 39
5.1 64/32bit note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
i
5.2 Existing configuration files . . . . . . . . . . . . . . . . . . . . . . . .39
5.3 AFLOPS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6 Conclusions and tools evaluation 43
6.1 Project assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 HPM evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.3 Paraver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
A Appendix 46
A.1 Source code and sample Makefile . . . . . . . . . . . . . . . . . . . . 46
A.2 Sample Paraver .cfg file . . . . . . . . . . . . . . . . . . . . . . . . . . 50
ii
List of Figures
1 Xprofile snapshot processing the profiling data of a SPEC benchmark.
Picture taken from IBM’s web site. . . . . . . . . . . . . . . . . . . . . 8
2 This is a picture of an MCM. Picture taken from wikipedia. . . .. . . . 16
3 A trace loaded to paraver . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Filtering Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5 A created view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Timescale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7 Two views to be used to create a derived . . . . . . . . . . . . . . . . . 32
8 Derived metric view . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
List of Tables
1 Available registers for hardware counters for some architectures. Infor-
mation taken from [7]. . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 HPCx’s interconnect main characteristics. . . . . . . . . . . . . .. . . 16
3 AFLOPS results from libHPM and Paraver for the a(i)+b(i) calculation
in the simple “Hello” code . . . . . . . . . . . . . . . . . . . . . . . . 42
iii
Acknowledgements
I would like to thank my supervisor who was always helpful andProf Jesus Labarta,
Judit Gimenez, German Llort and Harald Servat from the Barcelona Supercomputing
Centre for their tutorials and their help. Without them this dissertation would have been
impossible. I would like to thank my parents who supported methroughout this year
that I was doing my MSc in HPC.
iv
Abstract
Parallel profiling is the key to optimising a code for High Performance Computing.
Parallel profilers monitor what every element (process and/or thread) does while exe-
cuting and visualise statistics of what was monitored. Graphical representation of these
statistics make it easy for a developer to identify performance problems. Some of the
problems that can be easily identified are load imbalanced decompositions, excessive
communication overheads and poor average floating operations per second achieved.
Paraver is such a performance analysis and visualisation tool which can be used to anal-
yse MPI, threaded or mixed mode programmes. Paraver can alsoreport statistics based
on hardware counters, which are provided by the underlying hardware. This project
aimed to develop Paraver configuration files to allow certainmetrics to be analysed for
IBM Power5, based on similar metrics that already exist for Power4 architectures. Their
accuracy was verified by profiling a set of codes using both Paraver and IBM’s HPM
toolkit, which is known to be accurate.
v
1 Introduction
’The First Rule of Program Optimisation: Don’t do it. The Second Rule of
Program Optimisation (for experts only!): Don’t do it yet.’
Michael A. Jackson.
Significant progress in the available computational hardware is constantly made and
software, already available, usually needs to be modified sothat progress can be ex-
ploited. The frequency of CPUs available as well as their transistor density has steadily
increased, roughly following Moore’s law for the last few decades; but the highest fre-
quency a CPU can reach with today’s technology seems to have hit a limit that is hard
to overcome. Meanwhile the number of transistors permm2 continues to get closer
to the natural limit of miniaturisation at atomic levels following the same exponential
pace Moore’s law predicted in 1955 and is almost dictating the processor industry today.
However processing power continues to increase as new technologies evolve. Nowa-
days for example, efficient utilisation of multi-core processors, use of graphics process-
ing units for general-purpose computing, utilising the CellBroadband Engine (Cell/B.E.
or simply the cell chip) from IBM-Sony-Toshiba or utilising Field Programmable Gate
Arrays (FPGAs) can provide processing power that conventional CPU’s will probably
take some years time to deliver. At the time of writing RoadRunner, the world’s fastest
supercomputer, is one such example where unique performance is available if one can
exploit the underlying hardware. New tools and software development kits (SDKs) are
developed and provided with new hardware; these can help in either utilising new hard-
ware or adapting already existing software to new hardware.Quite often these new
tools require tools and SDKs that already exist whilst new software is often built on
top of or extend existing software. An example of such tools are software profiling
tools. They already exist and are often updated to support new architectures and fea-
tures while they are of vital importance to HPC software development and source code
fine tuning. These tools provide run time information of a programme that makes port-
ing and optimising a code for a machine a lot easier thus helping developers understand
a programme’s behaviour so that they can bring changes that will result in a programme
running significantly faster. This optimisation process can be quite difficult and actu-
ally consists an active research area. There are some technical reports on the HPCx web
page which nicely demonstrate how to use them in general or the use of such techniques
1
to improve a programme’s performance or simply port it: [1],[2], [3].
Code development is a non-trivial task and the most importantgoal is always code
correctness. No matter how well written or innovative a programme might be, it is use-
less if it does not actually do what it is supposed to. In High Performance Computing
(HPC), next to code correctness, performing efficiently is also very important. Sophis-
ticated data structures or advanced numerical algorithms will not be beneficial if poorly
implemented. Designing an application carefully is also important but it can not always
ensure that the implementation will perform as estimated. Therefore when writting new
software or when porting software to a new machine, after code verification, optimi-
sation is needed. Out of all source code lines that can be rewritten to perform better,
only some are worth tweaking. Typically, in an HPC code, onlya small fraction of the
code processes the useful calculations and takes the most time to complete. Time spent
on optimising this fraction of the code will result in greater performance improvements
than spending the same time on optimising other parts of the code. This fact applies
to both serial and parallel codes but the latter require evenmore work, because par-
allelising a code is also non-trivial and might ruin the performance of a code which
performs very well on a single process. There are several tools to assist a programmer
in improving serial performance and there are also some thatmonitor parallel execu-
tion. These performance analysis tools are very important and useful for understanding
a programme’s behaviour so as to maximise hardware utilisation. Different tools have
different features and even though their users are programmers, who usually are power
users and more capable than the average computer user, ease of use is as important as
flexibility and functionality. A lot of HPC-related optimisations on a single process
actually aim on having enough data flowing into and out of the CPU so that CPU cy-
cles wasted are minimised. This is usually the case because the progress in processing
power of microprocessors was significantly faster than the one in computer subsystems
such as memory, interconnects or storage speed.
This project is related to what was just discussed in the sense that its aim is to implement
an easy way to access advanced features of a microprocessor,features that can be used
to optimise codes running on the microprocessor. Modern microprocessors are able
to monitor execution streams and measure several features of their run time behaviour
such as the number of instructions completed. These features are not very easy to
access at a low level, thus usually being used by higher levelprogrammes that are
used in profiling codes. Profiling tools facilitate these hardware features to extend the
2
range of information provided and make an easy way for accessing these advanced
features. The hardware that we worked on for this project is not as revolutionary as the
ones previously described but the same principles should apply to them as well. The
problems encountered and the lessons learned from this project do not differ radically
from other computer systems.
1.1 Dissertation structure
The remainder of this dissertation is structured as follows. The second section dis-
cusses on software optimisation, why it is a necessity and also talks about software
commonly used for optimising code that runs spawning a single execution thread or
multiple threads. It also talks about the aim of this projectand how it is related to the
rest of what is in the section. Section three describes the hardware and software envi-
ronment under which this project took place. Section four describes the software that
was profiled to test paraver configuration files. Section five presents some of the results.
Section six is an assessment of the project, evaluates the HPM toolkit and Paraver as
well as proposes possible future work.
3
2 Profiling and paraver configuration files
2.1 Software performance optimisation
Software optimisation requires finding a bottleneck: the critical part of the code, the pri-
mary consumer of processing resources, usually processor cycles. As a rule of thumb,
improving 20% of the code is responsible for 80% of the results, as the Pareto principle
apparently applies to software optimisation as well as other non computer science re-
lated fields [4]. According to observations made on several scientific fields (economics,
human population studies), 20% of the possible causes are responsible for 80% of the
consequences ; this is known as the Pareto principle or the principle of factor sparsity
or the law of the vital few or simply the 80-20 rule. The Paretoprinciple can be applied
to resource optimisation because it is also often observed that 80% of the resources are
typically used by 20% of the operations. There are other variations of this rule as well,
one being that 90% of the resources are consumed by 10% of the consumers. This
variation of the rule is often more suitable to be applied in software optimisation by
approximating that only 10% of a code takes to complete 90% ofthe total run time.
Performance improvements are often implemented by adding code that helps improve
performance. Such optimisation often complicate a source code and make it harder to
maintain and debug. Maintainability and readability are more important than efficiency
in the early development stages. As Donald Knuth said:
"We should forget about small efficiencies, say about 97% of the time:
premature optimisation is the root of all evil." [5]
"Premature optimisation" refers to a situation where the design of a piece of software is
influenced by performance oriented decisions. This might result in a design that is not
as clean as it could have been or code that is incorrect. Theseare because the code is
complicated due to optimisation resulting from the programmer distracted by optimis-
ing. A simple and elegant design is often easier to optimise at this stage and profiling
may reveal unexpected performance problems that would not have been addressed ap-
plying premature optimisation. In practice, it is often necessary to keep performance
goals in mind when first designing software, but the developer is to balance the design
and optimisation goals. Therefore the recommended approach is to design first, code
following the design and then profile and benchmark the resulting code to see what
4
should be optimised.
2.2 Tools for profiling serial applications
For measuring a programme’s performance one can simply add timing instructions to
do so. This approach is not very adequate as it requires source code modification and
recompilation which might introduce bugs or may not be possible at all if source code
is not available. If this method is used, it only measures execution time of the sections
of the programme timed. Usually this provides little information, especially if the time
required to modify the source code and run is also taken underconsideration.
Profilers are tools relevant to this performance analysis effort. These tools provide more
detailed information than the process described, and require less effort. They measure
the behaviour of a programme. Using a profiler usually requires recompiling the pro-
gramme to be profiled with debugging symbols enabled. Profilers use a wide variety of
techniques to collect data such as hardware interrupts, code instrumentation, operating
system hooks and performance counters. Profiling tools are also useful for visualising
memory usage and identify memory leaks; but, in this dissertation, only performance
related features are under consideration as it is assumed that the programme analysed
works correctly and a profiler is used to decrease its total execution time. A profiling
tool records a stream of recorded events or a statistical summary of the events observed
which are then viewed or furthered analysed. Sometimes the former is called a trace
and the latter a profile, but they are often used interchangeably. In this paper, tracing
or profiling refers to the procedure that either collects a stream of recorded events or a
statistical summary while a profile or trace refers to any output (graphically visualised
or not) of them or any analysis performed on them. Some profiling tools do not record
all of the events that actually happen but they statistically record the current state of
the programme at regular intervals. Theoretically there isthe possibility that events
completing in less than the sampling time interval will rarely or never be sampled and
therefore might not appear at all in the profile summary. Profiles that did not record
certain sections of the code are unlikely to be created. The point is that inaccuracies
may appear when using sampling based tools; this is something to be taken under con-
sideration.
Based on the type of the output, there are two kinds of profilers, flat and call graph
5
profilers. Both of them count the frequency and duration of function calls, as the pro-
gramme being profiled runs. Flat profilers compute how much time a programme spent
in each function and how many times this function was called while they do not break
down call times based on the callee or the context. Call graph profilers show calling
frequency of the functions as well as the call-chains involved, based on the callee; ie it
shows for a function, which ones called it, which ones did it call, and how many times.
There is also an estimate of how much time was spent in each function’s subroutines.
This can suggest places where you might try to eliminate function calls that are time
consuming.
Another way to categorise profilers is by the method used to gather data. This way,
there are event based profilers, statistical profilers and instrumentation based profilers.
Instrumentation based profilers require source code modification and recompilation,
so that the instrumented programme includes calls triggering several events, that the
profilers records at run time. Usually these calls are used bythe programmer to instruct
the profiler which events are desired and should be recorded.The process of adding
the code needed to instruct the profiler is known as instrumentation. The fact that the
programmer can ask the profiler to record what the programmerwants, is an advantage
of instrumentation over other types of profiling. Instrumenting a code is flexible in both
choosing which parts of the code should be profiled as well as which events should be
trapped for these parts. An additional advantage of instrumenting codes is that since
the programmer can instruct the profiler on which parts of thecode to profile, there is
less data recorded, thus making the trace smaller and easierto manipulate. There are
two types of instrumentation. The first instructs the profiler to switch profiling on or
off while the second simply labels different sections of thecode. Apparently having to
modify a code for profiling it is not very convenient and requires recompilation.
In general three steps are required to profile a serial application:
1. If necessary, compile the application source code using some compiler flags or
adding instrumentation calls.
2. Run the executable to collect run time information and produce a data file.
3. Process the data file with the profiler and analyse its output.
Some of the most common utilities are briefly outlined in the following section. Most
of them are also available on HPCx.
6
Prof andgprof are widely known and used tools and are available on almost every unixsystem, often called traditional profiling tools. They workby statistically monitoringthe programme counter. Prof generates a statistical profileof the CPU time used by aprogramme, as well as an exact count of the number of times each function is called.Gprof generates a statistical profile of the CPU time used by a program, along with anexact count of the number of times each function is called andthe number of timeseach caller-callee pair is traversed in the programme’s call graph. Prof only producesa flat profile whereas gprof outputs call graph profiles. The CPUtime is estimated bystatistically monitoring the programme counter (PC) register. An example of a simpleprogramme with a few dummy calls calling each other is:
granularity: Each sample hit covers 4 bytes. Time: 0.58 seco nds
called/total parents
index %time self descendents called+self name index
called/total children
6.6s <spontaneous>
[1] 52.9 0.31 0.00 .__mcount [1]
-----------------------------------------------
0.19 0.05 1000/1000 .main [3]
[2] 41.4 0.19 0.05 1000 .foo1 [2]
0.05 0.00 10000000/10000000 .foo2 [5]
-----------------------------------------------
0.00 0.24 1/1 .__start [4]
[3] 41.4 0.00 0.24 1 .main [3]
0.19 0.05 1000/1000 .foo1 [2]
-----------------------------------------------
6.6s <spontaneous>
[4] 41.4 0.00 0.24 .__start [4]
0.00 0.24 1/1 .main [3]
0.00 0.00 1/1 .__C_runtime_startup [212]
0.00 0.00 1/1 .exit [362]
-----------------------------------------------
0.05 0.00 10000000/10000000 .foo1 [2]
[5] 8.6 0.05 0.00 10000000 .foo2 [5]
Xprofiler is a GUI-based AIX performance profiling tool distributed aspart of the IBM
Parallel Environment for AIX[6]. It can be used to graphically identify which functions
are the most CPU intensive in a code. It provides a graphical function call tree as well
as a text profile pertaining to the code. Xprofiler can be used to profile sequential and
parallel C, C++, Fortran 90, Fortran77 and HPF programs. To useXprofiler, you first
compile and link the program, using the -g option to create anobject file with symbol
7
table references and the -pg option to enable profiling; thenrun the program to create
run time data file(s) (one for each processor involved in the execution), and finally,
invoke the Xprofiler utility to analyse and display the profiling information gathered.
Any compiler optimisation options can be enabled. Xprofilerdoes not provide data
while the profiled programme is sleeping and therefore it cannot be used to provide
information such as I/O or communication data.
Figure 1: Xprofile snapshot processing the profiling data of aSPEC benchmark. Picture
taken from IBM’s web site.
2.3 Hardware counters
Nowadays a rich source of statistical information on program execution characteris-
tics is provided by processors through hardware counters. These are a set of special
purpose registers built in to modern microprocessors to store the counts of hardware
related activities within computer systems. Hardware counters monitor, through hard-
ware, software events related to a CPU’s arithmetic and logicunits (ALU), or all levels
of the memory hierarchy or bus activity and can be used to optimise software. When
trying to improve a programme’s performance, for example, alot of cache misses might
suggest that restructuring it to improve data locality and increase cache reuse, can im-
prove its performance. Compared to software profilers like the ones based on sampling,
8
as described before, hardware counters provide low overhead access to a wealth of de-
tailed performance information. What is more, monitoring them does not necessarily
require source code modification. Examples of information that can be obtained in-
clude branch prediction (or mis-prediction) accuracy, instructions completed per clock
tick, all-level cache misses and cache stall cycles, translation lookaside buffer (TLB)
misses, elapsed CPU cycles, executed instructions and floating-point operations. There
is one more kind of metrics which are known as derived metrics. These are calculated
by combining the values provided by hardware counters and applying basic algebraic
operations on derived metrics. Derived metrics provide quantities that are easier to un-
derstand than the simple metrics. For example, a high numberof cache load misses
per total cache loads is clearly a serious performance problem while a high number of
cache load misses is not necessarily problematic.
The number of registers available from each microprocessorfamily differs amongst
different architectures. Every microprocessor employs registers to support a number of
hardware counters simultaneously and this number is definedby the family architecture.
The mapping in which registers can be used to count each hardware counter is a one
to many relationship, since the number of counter registersis limited, whereas there is
a number of hardware metrics that can map to each register. There are usually two to
eighteen registers available on a processor. Some of the hardware counters supported
by a processor can utilise any register, while others are only available on particular reg-
isters. As a result there are a lot of rules restricting concurrent use of different events.
Consequently, not all combinations of hardware counters canbe chosen in a single ex-
periment and out of all the events that can be counted, only a few registers are available,
and only certain sets of events can be counted at a given time.As a result there is only a
limited number of sets of hardware events that can be monitored at any time. Each valid
combination of hardware counters that can be assigned to registers at the same time, is
called a hardware counter group. Hardware counter groupingis also determined by
the fact that some of the hardware counters provide similar or related information. The
events that are monitored in each hardware counters group are chosen according to these
two facts. For example, branch mis-predictions and instruction cache misses are often
related, due to the fact that a branch mis-prediction causesthe wrong instructions to be
loaded into the instruction cache, instructions that must then be replaced by the correct
ones. The replacement can cause an instruction cache miss oran instruction (ITLB)
miss. The limited number of registers to store hardware counters values often forces
9
users, investigating on a code’s performance, to conduct multiple measurements, each
time using a different group, to eventually collect all the performance metrics desired.
The types and meanings of hardware counters vary from one architecture to another
due to variations in hardware organisation. Furthermore, it is difficult to correlate the
low level performance metrics back to source code. Profilingtools that make use of
hardware counter profiling data are capable of mapping hardware counter values to
source code lines. This is usually done through instrumenting the corresponding source
code lines. They are also able to compute derived metrics. Derived metrics can convert
counters that count in cycles to time in seconds. One more possibility is to combine
hardware metrics with derived metrics to measure quantities such as total L1 cache
load/stores per second. There is a variety of hardware counter based derived metrics,
especially since each type of processor has its own set and also because there is a large
number of them that can work combined.
Table 1: Available registers for hardware counters for somearchitectures. Information
taken from [7].Processor Available hardware counters
UltraSparc II 2
Pentium III 2
AMD Athlon 4
IA-64 4
POWER4 8
Pentium 4 8
2.4 Parallel profilers and libraries for parallel profiling
There are numerous tools widely used for parallel profiling,some of which provide
graphical representations on their analysis which is more convenient to the user.
HPM Hardware Performance Monitor (HPM) toolkit is created by IBM’s ACTC [8]
and consists of two tools. One is hpmcount and the other is libHPM. They are available
for Power architectures running AIX or LINUX.
Paraver/ompitools is one of the few parallel profiling visualisation tools available to-
day [9]. It is designed to target on a combination of several features. It is a performance
10
visualisation and analysis tool that can be used to analyse MPI, OpenMP, mixed mode
(both MPI and OpenMP) and Java codes while it is able to monitor hardware counters on
several platforms and is flexible enough to handle large trace files efficiently. It is based
on a Motif GUI which should be easy to use and it should also runon any platform. It
is developed by the European Centre for Parallelism of Barcelona at the Technical Uni-
versity of Catalonia (CEPBA/UPC). Paraver is currently available for SGI IRIX, Tru64
UNIX, IBM AIX, HP-UX, Solaris and Linux.
2.5 Description of this project
As previously discussed, hardware counters can count a number of events. However
raw values reported by the hardware counters are not very meaningful to someone try-
ing to optimise a code. The power of these values is shown whenthey are used to
calculate complex quantities. This way, virtually anything, performance related, can be
measured. For example, knowing the number of clock ticks a code needed to run or
the number of instructions that were executed by the CPU whilethe code was running,
are of little use unless divided so to calculate the number ofinstructions completed per
second, metric that clearly determines the code’s performance.
Paraver can monitor hardware counters and thus provide an easy way to access them.
This can be done if paraver is properly configured to use raw hardware counter values
and compute derived metrics which will be presented in a timeline view or as a 2D
histogram. Configuring paraver takes place through its GUI but is not trivial for rea-
sons that will be explained later. The steps to compute derived metrics should be taken
carefully and the computed values should be checked to verify that its correctness. Cre-
ating the configuration files is rather simple as it is done through paraver’s GUI. Testing
on values’ correctness that paraver reports, using a configuration file, is the tricky part
of configuration files creation. This process of configuration file creation is not easy.
Fortunately, if paraver is once correctly instructed on howto compute a derived met-
ric, then the configuration can be saved in a configuration filewhich can then be used
to obtain the same derived metric for another trace. Using paraver to access hardware
counters information is easy if the procedure previously described has already taken
place at least once. The process described is not easy. Usingthese configuration files
makes it easy to obtain derived metrics information based onhardware counter values
when profiling codes, other than the one used to create the configuration file.
11
The first goal of this project is to create configuration files that will allow paraver to
easily use hardware counter information on the current HPCx system configuration. The
validity of these configuration files need to be verified by putting them in comparison
to the values obtained when profiling a set of programmes withlibHPM. The HPM
library is part of the HPM toolkit which is provided by IBM for the HPCx system and
its functionality is verified. This is why it was used as a reference to compare the
results reported from paraver using the newly created configuration files. Configuration
files for the Power4 processors were already available on HPCxand their functionality
is verified. The existing Power4 configuration files could be used for reference or as
examples. Some configuration files for the Power5 chip were already available but they
have not yet been tested, action that have taken place.
Testing the configuration files created is not very straightforward because one needs
to decide on how different the values Paraver reports can be,comparing to the ones
libHPM reports, the code will not always perform the same; even with the same code,
using the same dataset, running on the same processor. When measuring quantities that
can be estimated from the code and if the values from paraver match those form libHPM
reports, then the configuration file is well created. For example, a configuration file, that
measures multiply-add instructions per second, can be tested by profiling a simple code
specifically written to execute a predefined number of multiply-add operations. On
the other hand, there are quantities that can not be counted unless a code is executed.
For example, when measuring branch mis-predictions we can not always expect to get
the same values between subsequent runs. Another example isif one tries on verifying
derived metrics, by measuring cache misses per memory access or translation lookaside
buffer (TLB) misses. That is why a statistical approach is needed to verify the results
obtained for these quantities.
As a secondary aim, this project targets on experimenting with performance oriented
profiling. The plan is to create configuration files for paraver to compute derived metrics
on another architecture. Some codes that are widely used would be profiled on both
systems and performance on each system would go in comparison. The metrics created
for the other architecture should be the same or similar to the derived metrics created for
the HPCx system, and should be tested as well. Systems available that could have been
used for this comparison are BlueSky (IBM / BlueGene / PowerPC 440), HECTOR
(Cray / XT4 / AMD Opteron x86_64) or Ness (Sun / AMD Opteron x86_64). None of
the secondary goals were achieved as it turned out that time was insufficient to allow
12
any work to be done on another machine.
13
3 Working environment
3.1 The HPCx super-computing service
HPCx is one of the largest supercomputers in the United Kingdom and also is one of the
UK’s national super-computing facility. In the list of the top five hundred supercomput-
ers in the world it is ranked261st [10], as of June 2008, while it was the second fastest
computer in Europe. In November of 2002 it was the ninth fastest supercomputer in
the world. The machine is a cluster of IBM SMP nodes delivering15.36 TFLOP/s of
peak performance or 12.94 TFLOPs of sustained performance for the Rmax value of the
Linpack benchmark. It also has 5.12 TByte of memory availableand 72 TByte of disk
space. Finally there is a library of roughly 3584 tapes that provide a total of approxi-
mately 50 TB of tape capacity. The HPCx system is actually housed at the UK’s STFC’s
Daresbury Laboratory and operated by the HPCx Consortium. TheHPCx consortium,
namely UoE HPCX Ltd, is led by the University of Edinburgh, with the Science and
Technology Facilities Council (STFC) and IBM. EPCC provides theUniversity of Ed-
inburgh’s contribution.
3.1.1 Nodes: IBM eServer pSeries p5 575
The HPCx system uses IBM eServer pSeries p5 575 1.5 GHz nodes forthe compute,
login and disk I/O nodes. More detailed description of the system than the one that fol-
lows can be found on its web site [11] . The HPCx service is a typical shared memory
cluster and provides 160 nodes containing a total of 2560 IBM Power5 processors. Each
eServer node contains 16 1.5 GHz Power5 processors, in the form of eight Dual-Core
Module (DCM) each having two cores. In the Power5 architecture, a chip contains
two processors and four chips (8 processors) are integratedinto a multi-chip module
(MCM). Each MCM is configured with 128 MB of L3 cache and 16 GB of main mem-
ory. Two MCMs (16 processors) comprise one frame. The total main memory of 32
GB per frame is shared between the 16 processors of the frame.Each frame is a 16-
way logical partition (LPAR). The names LPAR and system frameare synonyms for
compute node on HPCx.
14
3.1.2 Power5 chips
The eServer compute nodes utilise IBM Power5 processors. ThePower5 is a 64-bit
RISC processor implementing the PowerPC instruction set architecture. It has a 1.5
GHz clock rate, and has an 8-way super-scalar architecture with a 20 cycle pipeline.
There are two floating point multiply-add units each of whichcan deliver one result
per clock cycle, giving a theoretical peak performance of 6.0 GFLOPs. In the Power5
architecture each processor has its own L1 instruction cache of 32 KB and 64KB of L1
data cache, with 128-byte lines, integrated onto one chip. Also on board the chip is the
L2 cache (instructions and data) of 1.9 MB, which is shared between the two processors.
Each processor has its own Level 1 cache, which is divided into a 32KB data cache with
128-byte lines and a 64KB instruction cache. The two processors on one chip share a
1.9MB Level 2 cache and they also share a 36MB Level 3 cache. Each node on HPCx
has a main memory of 32Gbytes. The inter node communication on HPCx is provided
by High Performance Switch (HPS) from IBM and intra node communication is via
shared memory. The following lists how many cycles are needed to fetch data from
each cache level:
L1 cache 3 cycles to retrieve data.
L2 cache 15 cycles to retrieve data.
L3 cache 80 cycles to retrieve data.
Main Memory 350 cycles to retrieve data.
Being a cluster of shared memory servers, each with a sophisticated multilevel cache
memory system, the use of superscalar processors, which have multiple functional units,
potentially magnifies memory access inefficiency a code might suffer. The Power5 has
two floating point units and the theoretical peak performance can only be achieved if
independent instructions can be issued to both of these units during each cycle. In
practice, however, in addition to the idle processor cyclesmentioned above (“vertical
waste”), there are cycles for which the processor is not idlebut does not utilise all
functional units also (“horizontal waste”).
Each chip contains two processors, together with the Level 1(L1) and Level 2 (L2)
cache. On this system, each processor has its own L1 instruction cache of 64KB and
L1 data cache of 32KB integrated onto one chip. The size of theon board L2 cache
15
(instructions and data) is 1.5MB, which is shared between thetwo processors. Four
chips (8 processors) are integrated into a multi-chip module (MCM) and four MCMs
(32 processors) comprise one frame. Each MCM is configured with 128MB of L3 cache
and 8GB of main memory. An MCM that holds four processors and four L3 caches is
pictured in figure 2. The total L3 cache of 512MB per frame and the total main memory
of 32GB per frame are shared between the 32 processors of the frame.
Figure 2: This is a picture of an MCM. Picture taken from wikipedia.
3.1.3 HPCx Interconnect
Inter node communication (between frames) is provided by IBM’s High Performance
Switch (HPS). Each eServer frame has two network adapters and there are two links
per adapter, making a total of four links between each of the frames and the switch net-
work. HPS is a very sophisticated interconnect, fast in bothbandwidth and latency, and
is of vital importance to the maximum sustainable performance that HPCx is capable
of. Table 2 includes some timings which were found at IBM’s website [12] and they
indicate HPS’s capabilities.
Table 2: HPCx’s interconnect main characteristics.Quantity 1.9 GHz POWER5+ p5-575
Latency 3.6µs
Bandwidth 1.88 − 5.79GB/sec
16
3.2 Operating system
The operating system running on each LPAR is IBM’s version of unix, AIX version
5.3. AIX (Advanced Interactive eXecutive) is the name givento a series of proprietary
operating systems sold by IBM for several of its computer system platforms. AIX
5L is an open standards-based OS that conforms to The Open Group’s Single UNIX
Specification Version 3[13] and it is based on UNIX System V with 4.3BSD-compatible
command and programming interface extensions. The AIX 5L 5.3 release runs on up
to 64 IBM Power or PowerPC architecture central processing units and 2TB of RAM.
3.3 Compiler, libraries and other software
For this project the following tools were also used:
• Hardware Performance Monitor toolkit (hpmcount and HPM thelibrary).
• Paraver / ompitools.
• Small test code specifically written for the purposes of thisproject.
• The parallel version of the Stream benchmark, stream_mpi.
• The LAMMPS Molecular Dynamics Simulator.
The HPCx user support web site provides information on compiling and submitting jobs
to the machine.
The system has the IBM XL for AIX compiler suite available. There-entrant versions of
the C or Fortran compilers were used as they produce thread safe binaries. In particular
the xlf90_r, xlc_r were used for programs that do not make useof the MPI library.
To compile using the MPI library, the mpxlf90_r and mpcc_r shell scripts were used.
These shell scripts compile Fortran or C/C++ programs while linking in the Partition
Manager, the Message Passing Interface (MPI) and (optionally) Low-level Applications
Programming Interface (LAPI).
All codes were compiled to address 32-bit address spaces by using the -q32 compile
option because paraver on HPCx currently works only with executables addressing 32-
bit address spaces, due to a bug with IBM’s Dynamic Probe Class Library (DPCL)
that IBM has not fixed yet. All codes were also compiled using -O2 optimisation level
17
so that the executables were not heavily optimised, thing that might lead a profiler to
confusion. On the other hand some level of optimisation is needed, for example to make
sure the binary does not contain instructions or variables that are not needed (dead code
elimination, copy and constant propagation etc).
HPCx provides two environments for the submitted jobs to execute. One is the batch
processing system and the other is the interactive execution environment. Both en-
vironments are accessed through LoadLeveler [14], the workload manager as well as
scheduler on HPCx. The first is the environment that is common in distributed systems
where submitted jobs are scheduled to run and any output is returned to the user when
execution has completed. The interactive environment allows for programs’ interactive
execution, mainly for debugging purposes. Interactive jobs are not queued. Program’s
interactive execution has exclusive access only to CPUs and not to a whole LPAR. Fi-
nally there are only two LPARs available for interactive use which means that, at most
32 CPUs can be used. If one requests some CPUs for interactive use while not enough
available, execution is cancelled. To run either interactively or through the batch sched-
uler we must uses IBM’s parallel environment (poe) and we are in need of a batch
script. The difference, when running interactively, rather than batch processing, is that
run time environment variables should be defined in one’s shell and not in the batch file
because batch file variables are ignored when running interactively.
All jobs submitted for this project requested 16 processors. Time on HPCx is charged
on multiples of nodes, sets of 16 processors, not multiples of the number of CPUs
that were actually used. HPCx’s batch scheduling system allocates full nodes; if the
requested processors are fewer than those available on the allocated nodes, more L2
and L3 cache will be available for each processor than when submitting production
runs on HPCx, where use of full CPU availability is made. This would not affect the
results of this project but sticking to realistic environments and conditions as much as
possible was preferred.
3.4 HPM toolkit
HPM was briefly described in section one and some informationspecific to HPCx and
this project is given here. One of the two utilities providedby the HPM tool kit is
hpmcount and the HPM library, or else libHPM.
18
3.4.1 hpmcount
Hpmcount is used the same way as the time command is; Typing hpmcount followed
by the name of the executable to be profiled, plus any additional options required, is
needed. This runs the executable and a trace is created upon completion. A job can be
submitted to profile a code following the usual way to submit jobs by typingllsubmit
parallel.ll . The following variables need to be included in the batch filethat is
used (in this exampleparallel.ll ):
export HPM_DIR=
"/hpcx/usr/local/packages/actc/hpct/lib/"
export HPM_INC=
"/hpcx/usr/local/packages/actc/hpct/lib/"
export HPM_LIB=
"/hpcx/usr/local/packages/actc/hpct/lib/"
export HPM_EVENT_SET=1
poe.real ./programme_name
The HPM_EVENT_SETvariable is used to specify the hardware counter group that
is desired. To handle them easier, a list of valid groups is usually provided by the
underlying software, supporting hardware counters. For the HPCx system, this is the
pmlist utility. The following were used as suggested by the HPCx user’s manual to
be included in any batch file:
export MP_EAGER_LIMIT=65536
export MP_SHARED_MEMORY=yes
export MEMORY_AFFINITY=MCM
export MP_TASK_AFFINITY=MCM
This is all needed to run the executable, collect the necessary data and create a report of
hardware counters values during execution along with some predefined derived metrics
that hpmcount calculates. Hpmcount profiles the whole program under consideration,
just like the time utility, instead of parts of it. Hpmcount outputs one file per process
which contains the hardware counters values recorded and some predefined derived
metrics as well as some other information.
A small note about hardware counters should be made here. Hardware counters are
private to each thread. This means that it is the library’s responsibility to ensure that
19
if a thread’s execution is suspended, the hardware countersare copied to a temporary
location and restored when the thread is switched back. Thisfact ensures that the mea-
surements are not affected by the operating system or by other threads’ noise. Most im-
portantly, although HPCx’s nodes are exclusively allocatedto jobs, one might run more
threads than the available processors to utilise simultaneous multithreading (SMT), so
it is important that this subtle issue is taken care of.
3.4.2 libHPM
Using hpmcount is easy and can give some early results very soon without much ef-
fort. However more complex analysis is usually required andcan not take place with
the hpmcount utility. As hpmcount profiles the whole code from start to end, mea-
surement contains information for parts of the code that arenot of optimising interest
such as pending IO or initialisation time. Only the computationally intense parts of the
code are the ones we need to profile, in order to identify performance bottlenecks but
monitoring these parts can not be done without instrumenting a code. Using libHPM’s
instrumentation calls, a programmer can mark regions of a code that the HPM run time
library will profile. When tracing is complete, the per trace output files contain one
report per instrumented section, rather than one report forthe whole code.
The interface provided by libHPM is rather simple. Several versions of the HPM toolkit
are available on HPCx on various paths but the one that was eventually found to fully
support all hardware groups available by Power5 processor is in this location:
/usr/local/packages/actc/hpct/lib/ .
Briefly, the interface to use libHPM is this:
hpm.h Is the header file that needs to be included.
hpminit(rank,"name"), hpmterminate(rank) Functions that initialise tracing and ter-
minate tracing.
hpmstart(instID,’sectionID’), hpmstop(instID) Functions used to identify which sec-
tions of the source code HPM will monitor.
After the hpm header file is included, tracing must be initialised by callinghpminit
immediately after any variable declarations. It does not matter when tracing terminates
20
as long as it does terminate before MPI finalise. To properly terminate tracing, hpmter-
minate should be called before the programme exits. The sections of the programme
that are of interest should be marked by callinghpmstart at the beginning of the
section, andhpmstop at the end of it.
After instrumenting a code we need to compile it against libHPM. The following flags
need to be used so that the license and the run time objects canbe loaded:
-lhpm -llicense -lpmapi
After recompiling the program, the following environment variable needs to be defined
in the batch file (or the current working shell in case of an interactive job) to specify
which hardware counters group is desired:
HPM_EVENT_SET=140.
An example of a full batch script that was used in this projectfollows:
#@ shell = /bin/bash
#@ job_name = helloHPM
#@ job_type = parallel
#@ CPUs = 16
#@ node_usage = not_shared
#@ bulkxfer = yes
#@ wall_clock_limit = 00:01:00
#@ account_no = z000
#@ output = $(job_name).$(schedd_host).$(jobid).out
#@ error = $(job_name).$(schedd_host).$(jobid).err
#@ notification = never
#@ queue
export MP_EAGER_LIMIT=65536
export MP_SHARED_MEMORY=yes
export MEMORY_AFFINITY=MCM
export MP_TASK_AFFINITY=MCM
export HPM_DIR=
"/usr/local/packages/actc/hpct/lib/"
export HPM_INC=
"/usr/local/packages/actc/hpct/lib/"
21
export HPM_LIB=
"/usr/local/packages/actc/hpct/lib/"
export HPM_EVENT_SET=128
poe.real ./hello
Finally it should be pointed out that these are the C compatible declarations for using
HPM. For fortran codes names are similar; the difference is that names mentioned in
the previous list have the prefix “f_” . More detailed documentation on using both
hpmcount and libhpm is available on IBM’s web site [15] and on the HPCx support
web site[16].
3.5 Using paraver to profile codes
Available on the HPCx web site is a FAQ about using Paraver on HPCx [17] and a begin-
ner’s guide [18]. The following, pretty informative, technical reports on using Paraver
on HPCx [2], [1] are also available. For more information refer to other documentation
that is available on paraver’s web site. Users interesting in applying their information
on HPCx should be aware that details such as installation directories, names, etc are not
always as specified in the above sources. However, by readingthese documents one can
either quickly create a trace and start analysing a code or learn how to make advanced
use of paraver.
Paraver uses IBM’s DPCL which requires a .rhosts file in the homedirectory containing
a list of every node on which the code is likely to run. Paraveris available on HPCx
but currently only works with executables that have been compiled and linked using the
-q32 compiler option which uses 32-bit addressing. This is because of IBM’s DPCL
incompatibility with 64-bit addressing.
Even though paraver is an advanced tool, and well documented, some issues were en-
countered in this project. Paraver often crashes and savingfrequently is required to
avoid losing work. There is also a peculiar problem which hasto do with compiz,
a compositing window manager installed on the machine that paraver analyses were
done. When compiz is enabled, paraver windows do not work well. There is an is-
sue with notification windows that ask for user input. The problem is that they are
not easily visible as they are only a few pixels wide and tall.However after resizing
them properly, they work well. This compositing window manager is not as solid as an
22
X-window server and should probably not be used on a production machine; neverthe-
less, it is quite common on desktop systems. Furthermore themost major issue is that
the documentation is not enough for the objectives set in this project. Apparently, the
work involved in this project is advanced and is not supposedto be performed by most
users as it is hardly described in paraver’s manuals. There are several inaccuracies in
the information found which often resulted in project stalling, until the correct infor-
mation was found. A couple of examples are mentioned later when describing how to
instrument programmes to be profiled with paraver.
3.5.1 Basic paraver tracing
In order to use paraver we need to run a code throughompitrace , paraver’s tracing
tool, and then start paraver’s graphical view and analyse the trace. For this process
no source code instrumentation is needed. A profile which might or might not include
hardware counters values, is created for the whole code. Thesteps to create a trace are
these:
1. Run the executable throughompitrace to create a temporary trace file (.mpit )
for every process.
2. Runompi2prv to combine trace files to a single paraver (.prv ) file.
3. Startparaver to visualise the generated.prv file.
As is seen in this listing, there are three main tools provided by the paraver package.
The first isompitrace which is the programme that monitors the profiled code at
run time and creates a temporary trace file for each process. The second isompi2prv
which combines all the temporary files created at run time to asingle paraver file. The
third isparaver which is used to visualise and analyse the final file that is created by
ompi2prv . Executing the code which is going to be analysed is requiredto collect
data for each process and record them to a per process trace file. Apart from its design
features that allow it to handle very big trace, there is alsoanother tool in the paraver
package that makes manipulation of big traces easier. This iscutter and can generate
horizontal or vertical cuts of an existing trace file. Cuttinga trace horizontally selects a
subset of processors whereas vertically cutting results ina trace of a time subset.
23
A very powerful feature of the paraver package lies in this process because these steps
are clearly separated. They needn’t be executed on the same computer system; each
one can be done on one of the systems paraver is available for,if this is convenient to
the user. This feature is also a necessity some times. For example, manipulating the
per process generated trace files may be slow for a desktop’s hard drive compared to a
cluster’s storage system. On the other hand if a lot of memoryis required, a desktop
often has more RAM than a cluster’s node. Finally analysing a trace file is easier to do
on a local machine than by connecting to a remote X-window server which usually is
unacceptably slow due to network latencies.
There are two versions of paraver available on HPCx. One is available on
/usr/local/packages/paraver/
and a newer version on
/usr/local/packages/paraver/newversion
To profile a code on HPCx using paraver the following lines should be added to the
batch file:
export
OMPITRACE_HOME=/usr/local/packages/paraver/newversion
export
MPTRACE_COUNTGROUP=1
The first line defines the directory where paraver is installed so that the license and
library files can be picked while the second line specifies which hardware counter group
to use. The following line:
poe.real ./hello
that is usually used to run a code must be replaced by the following command:
$OMPITRACE\_HOME/bin/ompitrace
-counters:mpi -v -r -nosw poe.real ./hello
This runs the code throughompitrace , whose arguments are briefly explained in this
listing:
counters:mpi This argument specifies that the hardware counters values should be
recorded and included in the trace.
24
v Invoke verbose output.
r This should be used if the re-entrant (_r) versions of the compilers are used.
nosw This switches off the software clock.
The re-entrant versions of the IBM compilers should always beused when compiling
code on HPCx, so the -r argument should always be passed toompitrace . The
software clock should be disabled for getting consistent times between nodes.
The first option passed toompitrace indicates that whenever an MPI call is trapped
it should trigger value recording of the hardware counters.The :mpi option is used to
specify that recording the hardware counters values shouldbe done every time a call to
the MPI library is made. Hardware counters can also be recorded when entering and
exiting user calls. This is done using the:calls option. Also a text file with the
names of the functions that will be used, needs to be created.
Trace file generation is completed by merging all the temporary trace files to a single
one that can be loaded to the paraver GUI. These two commands need to be used on
HPCx to merge the temporary files:
export OMPITRACE_HOME=
/usr/local/packages/paraver/newversion/
/usr/local/packages/paraver/newversion/bin/ompi2prv
* .mpit -s * .sym -o name.prv
The first line exports a variable so thatompi2prv can find the libraries and the license
included in the paraver package; the second command simply creates the final trace file.
3.5.2 Paraver instrumentation
The paraver package provides a user interface for the programmer to customise a code’s
trace. This is done by accessing functions to instrument a code for defining custom met-
rics and the values that they take. This can be done by callingtheompitrace_event
(event, value) function, where theevent value is used to specify which is the
defined metric the user refers to, andvalue is used to pass the value thatompitrace
should record for this event. These metrics are defined by theuser and can be anything
the user would like to include in the trace that will help later with the analysis of the
trace file. A couple of functions are also available to instruct the tracing utility to pause
25
or restart tracing. An attempt was made to use these two functions, but they did not
suit the purposes of this project; especially since we were advised by BSC specialists
to avoid using them.
Submitting an instrumented code to be profiled with paraver on HPCx is done as the
non instrumented codes are. What needs to be done is to recompile the code and in-
clude theompitracef.h which can be found in the installation directory of paraver.
As, for unknown reasons, the version of paraver that exists on HPCx did not work the
paraver package was installed at a user’s local directory which directory tree was used
to compile and link against. A sample Makefile that illustrates how it was done can
be found in the appendix. Finally, when linking a code instrumented with Paraver, the
ompitrace objects are also needed so the-lompitrace option needs to be passed to
the linker. The following lists some lines that were used in Makefiles for this project
and are enough to compile and link an instrumented code.
FF=mpxlf90_r
FFLAGS=-q32 -O2 #-qsuffix=cpp=f90
FCFLAGS=
-I/hpcx/home/z004/z004/nspattak/Paraver_new/include/
LFLAGS=
-L/hpcx/home/z004/z004/nspattak/Paraver_new/lib -lompitrace
The comment in theFFLAGSline was not needed for the case that these lines were
used, but it might be needed when compiling fortran codes. The reason for this is that
paraver is a c++ code and, although bindings for linking a fortran code exists, the IBM
compiler needs this option to look for suffixes other than thedefault ones.
A rare error was also encountered in this project when tryingto compile an instrumented
version of the stream_mpi code. The error was:
ERROR: 0031-309 Connect failed during message passing
initialisation, task 1, reason: Unable to allocate storage
This error is not unknown and is included on the HPCx FAQ where the solution was
found. The-bmaxdata:number compiler option sets the maximum size of the area
shared by the static data (both initialised and uninitialised) and the heap to size bytes.
This value is used by the system loader to set the soft ulimit.The default setting was
-bmaxdata=0 and apparently was not enough for the stream_mpi code to be com-
26
piled for both the MPI and paraver libraries. For the record,valid values of number are
0 and multiples of 0x10000000.
3.5.3 Paraver’s UI and basic views creation
Paraver’s user interface follows a simple logic. There are windows that visualise several
properties of the profile and these windows can be combined tocreate new windows
whose properties are a combination of the properties of the windows the new window
is based on. The windows that visualise properties of the trace are called “views” while
the rest of the windows are used to control the “views” windows. The “views“ windows
either show a timeline of events or a 2D statistical analysisof values. New views can
be created by modifying the number or type of events that an existing view shows.
Combining several views is done by applying basic arithmetic(add, multiply, subtract
and divide) on the values of already existing views. When a suitable view has been
created it can be saved in a configuration file. The next time the same view is desired
it can be created again automatically by simply loading the configuration file. In the
rest of this section, some familiarity with paraver is assumed as this is not a user guide.
Only the process of creating derived metrics configuration files and 2D statistical views
are described as well as the options that are of key importance to create and use these
files.
The default version of paraver’s GUI that is available on HPCx, fails to start due to
a license problem. For this project paraver was installed and executed from a user’s
directory. After a trace has been created, paraver was executed from the command
line. This starts up three windows, two of which are paraver’s main control windows.
The main windows are titled“paraver” and“Global Controller” while the
third one is the“Visualiser Module” window. The first one provides loading
and unloading of input files and the second gives access to other windows that provide
several other features paraver offers. After loading a trace there is also a window that
shows a timeline view of the whole trace. When paraver is initially invoked, there is
only a view that shows the whole trace loaded. Values of hardware counters can be
accessed by left clicking near an event in the main view. Events are marked by little
green flags in any view that has theFLAGbutton turned on. The little flags are shown
on the top of the timeline while theFLAGbutton is at the bottom of the view. Figure 3
illustrates the windows that are shown after following the instructions in this paragraph
27
.
Figure 3: A trace loaded to paraver
In order to begin creating other views we need to duplicate the initial view and start
working on the new one. This is done by right clicking on the existing view and choos-
ing clone . TheVisualiser Module controls the views that have been created
and shows the relationships several views have, which is something to be clarified later.
To customise the view created we need to use theFilter Module to filter the events
traced, so that we only show the ones we want to see. For example, to create a view
showing the instructions completed we should choose the clone of the first view, open
theFilter Module window from theGlobal Controller window and choose
to show theuser events of type instructions completed . There is one
more option next to the type field which should change from “all” to “=” so that only
this event is shown on the view. TheFilter Module is shown in figure 4.
Having selected the desired event, paraver does not update the view to depict values
of the type just chosen, until theREDRAWbutton is clicked. Often when this is done,
there is a small triangle with an exclamation mark in it, at the bottom left of the view,
that denotes that there is a problem with the colour scaling.This can be fixed by right
clicking on the view and selecting:
Scale -> Fit Y-scale -> Fit both Y-scale
Having followed these, there should be a new window showing the values recorded for
an event, like the one in figure 5, which can now be used to create a derived metric view.
28
Figure 4: Filtering Window
Before creating a derived metric view another one of paraver’s control windows should
be introduced. It is theSemantic Module that can be accessed by theGlobal
Controller window. The views created so far colour the timeline view, according
to values recorded at the next event that triggered hardwarecounter monitoring. This is
not always the desired behaviour. To illustrate this, suppose there are two events which
trigger monitoring of hardware counter values and the code that is profiled happens
between these two events, as shown in 6. The default behaviour is to colour according
to the valueV2 at timeT2. The derived metric we need to calculate might depend at the
valueV1 at timeT1 or at the weighted mean value between the two. For example one
might want each section in a view to be coloured according to the value the event had
at the beginning of each section or according to the time average between events. This
can be done by opening theSemantic Module and choosing the appropriate option
under
Thread -> Event
The following options can be used to achieve the examples described above:
Last Evt Val , Average Next Evt Val
One available option is to show the value that was recorded atthe beginning or at the
29
Figure 5: A created view
end of an interval between two events (Last Evt Val , Next Evt Val ). Another
interesting option is to automatically divide either of these values by the time period
of the interval (Avg Last Evt Val , Avg Next Evt Val ). It is important to
know these if one needs to reproduce the work done in this project as this was often the
source of mis-configured derived metric views.
3.5.4 Creating a configuration file
In order to create a derived metric view, one first needs to open two views that show the
metrics to be combined, as is shown in 7. In order to create a derived metric to calculate
the fixed point operations per cycle, which is this section’sexample, we need to have
one view showing the FXU producing a result metric, and one showing the processor
cycles. Both views were created as described in the previous section as shown.
To create the new derived metric window one selects the first metric’s window in the
Window Browser of the Visualiser Module and then presses theDerived
button. A small window opens and looks for the other view thatit should multiply the
first one with. After selecting the other view’s window name,a new view is created,
30
Figure 6: Timescale
composing the product of the previous ones and looks like figure 8.
By accessing theSemantic Module for the derived window, one can change the
operation to add, subtract, divide, maximum or minimum and assign weights to each
operand. The newly created view shows a derived metric whichis the result of the
operation chosen. To save the view to a configuration file, right click on the view and
choosesave as .
If the steps briefly described are not followed carefully, the results reported by paraver
might not be correct. No derived view should be used unless ithas been verified as
correct and that was the main goal of this project. Now that the process of creating
configuration files has been described it is easy to understand how important it is, for
anyone willing to use such features of paraver, to be aware ofthese details.
31
Figure 7: Two views to be used to create a derived
4 Aim and methodology, test case programmes
This project’s objective was to create configuration files for paraver that automatically
create views of the trace, showing values collected from thehardware counters or met-
rics derived from those counters. Such files already exist but not for the power5 pro-
cessor hardware counters. These views can also be used to obtain statistical summaries
of all these values, recorded or derived through paraver. These files were validated to
ensure that results are correct. Validation of configuration files took place by profiling
some codes using both paraver and HPM toolkit. The results are then compared for
proving whether they match or not. The problem in deciding whether the results match
or not is that some of the quantities measured are not always reproducible, neither can
they be estimateda priori.
Validating the results was easy for some of the derived metrics and harder for some
others. Derived metrics count quantities that can be estimated by looking at the source
code, are the easy ones to be verified. Examples of such quantities are floating point
operations or fused multiply-add instructions. Others hard ones are quantities that can
be estimated but are not known in advance. For example the number of data loaded
from the L3 cache can be estimated to be the total amount of a programme’s data, only
32
Figure 8: Derived metric view
if this programme does not use any of its variables twice. There are other quantities,
even harder to verify, such as cache misses or TLB misses.
When profiling parallel applications using libHPM, one text file is created for each
process traced. This file contains the difference of the hardware counters values at the
end from the values of the start, for each instrumented section. These files can be very
big, but as their being text files they can be easily manipulated, using standard unix tools
to extract values of interest when profiling an application.For this project the following
procedure was used:
1. Use a couple of commands to extract the hardware counters values for each in-
strumented section and paste them in a single text file.
2. Import this file to a spreadsheet and work on this data to create compute derived
metrics.
The commands that were used are:
for i in * .hpm; do sed -n ’38,43p;85,90p’ $i |
\ perl -np -e ’s/. * ?: * (\d * ). * ?/\1/g’ > ${i%%. * }.values ;done
paste * .values > ALL.values
One script file containing the above commands was created foreach hardware counter
33
group. Values of interest were not always at the same line in the output files amongst dif-
ferent hardware counter groups. The process of submitting acode to be traced was done
using scripts. Using several other script files, for copyingthese values to a spreadsheet
so that derived metrics could be calculated and compared to the ones paraver calculated
made it almost automated. The process of obtaining the derived metrics values for parts
of a code with paraver, is described later in this chapter.
When the spreadsheet was populated with the numbers, hpmLIB and paraver reported,
a very big amount of data. For each trace created for each processor the desired de-
rived metrics were calculated within the spreadsheet and then an average and standard
deviation was computed. These values were then compared to the ones paraver reports.
4.1 Custom written programmes
Some simple and short codes were written and used as test cases in this project. They
were profiled with both paraver and libhpm so as to verify the results paraver reported.
Although the codes created did not process useful calculations or did not provide a
service, they were useful for two reasons. The first was that learning how to use paraver
and the HPM toolkit was not straightforward, and using programs of complexity similar
to the “Hello, world” programme made it easy to learn; easierthan using production
ready codes which are quite complex. Both paraver and HPM toolkit are documented
and some technical reports are also available on HPCx, but it was not always easy to
reproduce what was documented. Simple codes are also usefulbecause they can be
written specifically for validating profiling of quantitiesthat can be estimated, such
as floating point operations completed. The computational parts of the code that was
finally used for this purpose are these:
integer:: ierror, i,j,rank
integer ,parameter:: N=2000,K=1000
real * 8 ::a(N),b(N)
a=1.123123d0
b=101.321456d0
do 10 j=1,K
34
do 20 i=1,N
b(i) = b(i) + a(i)
20 continue
10 continue
do 30 j=1,K
do 40 i=1,N
b(i) = 1.0 + a(i)
40 continue
30 continue
end program hello
The code used also included MPI calls to perform the same calculations on several
processors. This was chosen so as to facilitate all processors available on a compute
node on HPCx. Obviously this does not affect the results or theprocess of creating
configuration files, but it was preferred, as it mimics a real code better than running on
a single CPU. This programme performs a known number of floating point operations
and accesses a number of variables. Depending on the parameters N,K chosen, one can
also experiment with all levels’ cache misses while maintaining a run time that is not
too small, so that the programme does not exit too soon. Different array sizes can be
used to verify quantities that are not easy to test. For example if the arrays are small to
fit a certain cache level, then data loaded from this cache should be equal to the arrays’
sizes.
4.2 Stream Benchmark
Sustainable Memory Bandwidth in High Performance Computers (STREAM) a simple
synthetic benchmark program that measures sustainable memory bandwidth (in MB/s)
and the corresponding computation rate for simple vector kernels. It tries to measure
the effective speed of memory bandwidth of a computer by fetching data from memory
to the processor. This is often the bottleneck on modern computers. Floating point
operations per second can be severely lessened, if there area lot of cache misses, and
such a programme suffers all the L2, L3 or main memory latencies. This benchmark
uses two to three arrays and performs the following calculations:
35
c(j) = a(j) copy one array to the other.
b(j) = scalar*c(j) scales one array and stores it on the other.
c(j) = a(j) + b(j) adds two arrays and stores the result to the third array.
a(j) = b(j) + scalar*c(j) scales one array, adds it to an other and stores the result to the
third array.
Stream is a well-established benchmark and is also used in the HPC challenge bench-
mark targeting to big super-computing clusters. The MPI version of stream was used
for utilising every processor in an LPAR. In order to stress the memory system of a
computer, the benchmark operates on arrays, sufficiently large to not hold all the arrays
that it operates on. It is recommended to use RAM that is four times the size of the
available caches. On HPCx, each LPAR has 16 processors, each having 64KB of L1
data cache, 1.9MB of L2 cache for every pair of cores and 128MBof L3 cache for
every MCM (ie four chips/eight cores). In total there is2 ∗ 139366.4 = 278732.8KB
of caches available. However as this benchmark was used as a test code and not for
benchmarking the machine, several array sizes were used andsometimes fitted the L1
or L1 and L2 or all level’s caches. They were used to test that hardware counters count
cache hit or cache miss rates. Stream was also chosen becauseit is a simple code; yet it
is a fully functional real life used code.
4.3 LAMMPS code
Lammps is a molecular dynamics code that
models an ensemble of particles in a liquid, solid or gaseousstate. It can be
used to model atomic, polymeric, biological, metallic or granular systems.
as is mentioned on the HPCx web site. It is a code that scales very well on very large
processor numbers. It was used to test the paraver configuration files created using a
complicated production-ready code.
4.4 Using paraver to obtain information on sections of a code
The views created as described in the previous chapter and the configuration files that
were then saved, were used to obtain profiling information for the code sections as
36
described earlier in this chapter. By doing this, the first problem was how to trigger
hardware counter monitoring before and after the regions ofinterest. For the hello and
stream_mpi code this was done by adding barriers to mark the interesting sections of
the code. OneMPI_Barrier call was made just before the section was entered, and
one just after it finished executing. An alternative is to move these sections of the code
to a function and ask theompitrace tool to monitor them just before this function
is called, and immediately after it exits. The second methodwas used for triggering
hardware counters for the lammps code, as the first method wasnot applicable. One
text file was created and the name of the function that processthe main calculation was
written in it. This file was passed as an argumentompitrace so that it knew at which
function calls it should trigger hardware counting monitoring.
Unlike libHPM in which sections of interest are marked by calls to libHPM, paraver
monitors hardware counters every time an event (MPI of user function call) happens, as
was said in the section describing instrumentation with paraver. Currently there is not
a way to mimic libHPM’s behaviour in paraver neither a way to switch profiling off or
on; therefore a marking technique was applied.
A user defined event was used and several values of it corresponded to different regions
of the code. Number one was used to mark the regions of the codethat were not of
interest to us while numbers two, three, etc were used to markthe sections that were of
interest. Having done this one can then get a summary of the results for each region by
using the 2D statistical analyser paraver provides. The wayto do this, is as follows: The
trace is loaded with paraver in the usual way and a configuration file is loaded as has
already been described. Then a new view is created and is set to show the user defined
metric. TheStatistical Analyser is then opened and the cursor changes to
indicate that paraver is expecting the user to choose a section of one of the, active in the
current analysis, views. The user then chooses the whole timeline that shows the user
defined metric and a 2D statistical analysis is produced, currently containing the values
the user set the metric to. In our case that is1, 2, 3, ... and this is shown in the following
screenshot: To compute the statistics of the interesting parts, one needs to choose a
function underStatistic and then select the derived metric underData Window
and then selectRepeat . There are several functions available underStatistic ,
the most interesting ones for the purposes of this project being average and maximum.
This view can also be saved to a configuration file just like thederived metrics views.
It was decided to first change the user defined value and then trigger hardware counters
37
monitoring.
38
5 Results
5.1 64/32bit note
Even though the Power5 is a 64bit chip, all compilations werein 32bit. When the
hello code with 32-bit wide variables was profiled accessinghardware counter group
no 128, there were 50% more level 1 cache store/load instructions performed than ex-
pected. While running the simple code it was observed that when using 32bit variables
and addressing, one more operation was performed. It appears that all calculations are
performed in 64-bit arithmetic, as Power5 is a 64-bit microprocessor and storing 32-
bit floating point variables required one more instruction.Using 64-bit wide variables
decreased memory load/store operations to the estimated number of operations. Using
SIMD instructions (named VMX by IBM) would probably further improve the code’s
performance.
5.2 Existing configuration files
There were numerous configuration files that were tested for this project. We did not go
as far as to verify that the numbers Paraver calculates for these three codes are correct.
If there was more time, testing could have been completed andeven more results would
have been included here. They were not verified according to the intended workplan
but all files are properly defined. This was verified by manually looking at the hardware
counter values recorded by paraver and libHPM, which match.Opening text files to
look at libHPM’s results and then comparing them through eyeinspection with the
values Paraver reports is time consuming and error prone. However it was done as the
intended testing plan did not work out. The plan was to compile tables for each test
case containing values from both Paraver and libHPM. These values should be either
the same or similar, depending on the quantities measured. Testing was not completed
as was planned because we were not able to create such tables.This was because of a
number of reasons which vary from case to case. If one uses these configuration files,
one should be carefull with scales (especially time scales), with the visual representation
of the data in the views and how the hardware counter values are used to compute the
derived metrics. The list following shows the derived metrics that were tested. It lists
the name of the configuration file with the hardware counter group it requires and a
39
short description:
aflops, 137MFLOP/s.
data_L25_modified,50# of lines from L2.5 that were modified.
data_from_L25_shared,50# of lines from L2.5 that are shared.
data_from_L275_modified,50# of lines from L2.75 that were modified.
data_from_L275_shared,50# of lines from L2.75 that were shared.
data_loaded_from_L3,129Data loaded from L3 cache.
data_loaded_from_lmem,129Data loaded from local memory.
dataTLB_misses,43# of data TLB misses.
Fixed_point_op_per_cyc,144FXU produced a result per processor cycles.
FMA_percentage,137FPU executed one multiply add instruction per aflops.
Instructions_per_cycle,most instructions completed per cycle.
Instr_per_load_store,128 Instructions per load store.
Instr_per_run_cycle,most Instructions completed per run cycles.
L1_load_misses,142load from L1 misses.
L1misses_per_us,142L1 misses per microsecond.
L1_store_misses,142l1 store misses.
Loads_stores_perTLBmiss,130# of loads per stores per TLB miss.
Loads_per_load_miss,43# of loads per load miss.
Loads_per_TLB_miss,43# of loads per TLB bugs.
Local_L2_load_traffic,128 # Local L2 load traffic.
Local_L3_load_traffic,131 Local L3 load traffic.
Local_memory_load_traf,131 Local memory load traffic.
MIPS,all Millions of instructions per sec.
Stores_per_store_miss,44# of stores per store miss.
40
%TLB_misses_per_run_cy,130%TLB misses per run cycle.
TLB_misses_per_run_cy,130TLB misses per run cycle.
Total_FP_Load_Store_Op,141Total FP load store operations.
Total_L1_misses,142Total L1 misses.
Total_loads_from_L2,128 Total loads from local L2 cache.
Total_loads_from_L3,131 Total loads from local L3 cache.
Total_Loads_From_local_mem,131Total loads from local memory.
All of these configuration files read the correct values from the hardware counters and
the derived metrics are properly defined. This was verified bycomparing the raw values
reported from libHPM and from Paraver by hand. These values were verified only for
the simple code, as it was the one that this could be done manually. The other two pro-
grammes could not be checked this way; only by comparing the tables reported by Par-
aver’sStatistical Analyser to the same quantities computed using libHPM’s
output. These quantities were computed in a spreadsheet file, as was described in an
eralier chapter. When loading some of the other configurationfiles, minor adjustments,
such as scaling, might be necessary. We were not able to determine exactly which con-
figuration files are ready to use and which are not. This is so because one configuration
file appeared to be correct while profiling the simple code while it was not correct when
profiling thestream_mpi code or vice versa. One common problem was that the
configuration files calculated incorrectly scaled metrics and we had to change the scale.
5.3 AFLOPS
One of the first derived metric we worked on, was the algebraicfloating point oper-
ations per second. The floating point operations per second achieved is one of the
most interesting metrics and serves well as an example of thetesting process we fol-
lowed. This is a tricky metric as the Power5 chip does not provide a hardware counter
that counts FLOPs. Instead the FLOPs rate can algebraicly becomputed by using
thePM_FPU_FIN, PM_FPU_STF, PM_FPU_1FLOPhardware counters. The Power5
chip provides thePM_FPU_FINhardware counter which counts how many instructions
the floating point unit produced. As the Power5 processor supports fused multiply-add
41
instruction, such operations are counted as a single instruction, which they are. From
a performance perspective however this instruction countsfor two floating point opera-
tions. To complicate things more, this counter also counts floating point stores. As non-
FMA flops are counted byPM_FPU_1FLOPand floating point stores byPM_FPU_STF
the actual flop count is given by this expression:
2*(PM_FPU_FIN- PM_FPU_STF) - PM_FPU_1FLOP
This can also be found in [2]. It should be noted here that there is no single counter
group that contains all these three counters. To overcome this the following approxima-
tion can be used:
PM_FPU_1FLOP+ 2 * PM_FPU_FMA
It is this formula that the aflops configuration file uses. Table 3 shows the average value
that was recorded by libHPM for each counter of all 16 processors and then computes
the aflops derived metric. The last line shows the aflops metric as Paraver calculates it.
a(i)+b(i) aflops calculated (FLOPs)
libHPM
PM_FPU_1FLOP 2000010.31
PM_FPU_FMA 0
(FPU_1FLOP+ 2 * FPU_FMA) * e−6 2
Paraver 2
Table 3: AFLOPS results from libHPM and Paraver for the a(i)+b(i) calculation in the
simple “Hello” code
The aflops.cfg and the results.odf are included in the electronically submitted archive
and is also included in the appendix. The aflops.cfg file can beused in paraver to create
a view that is suitable for the statistical analyser to produce results similar to the ones
shown here. The results.odf contains all the measurements that this section discusses
about.
42
6 Conclusions and tools evaluation
6.1 Project assessment
Only a few of the aims of this project were accomplished sincethe project experienced
time shortage. Delivering more configuration files was a major objective for this project
which was not fully achieved as time constraints did not allow. Were they created and
tested they would be of great use to users who are willing to use paraver for optimising
their software on HPCx. However the ones tested are some of theones that are more
likely to be used and now they are available. Another objective of this project was to
create the same, or similar, configuration files for a clusterof a different architecture
and use them to compare the performance of several codes on different architectures.
The original work plan was to finish creating and testing the configuration files for
the Power5 architecture by the end of July. Apparently this is not the way it actually
went down and there are some reasons for that. In the very beginning of this project,
a problem was the fact that the available documentation, concerning the environment
we worked on, was not accurate. This resulted in a lot of time spent on trying to find
out where is paraver or the HPM toolkit installed and gettingsome profiling results
from them. Even when some directories were found, the versions of the tools that
these directories contained either did not fully support the Power5 microprocessors or
they where not used correctly. Using the HPM toolkit suffered the most from this
documentation issues.
Another reason that this project was so much delayed is that this project relied heavily
on instructions that we would be given from paraver’s authors. Eventhough they were
very helpful and they did give us the instructions we needed,this process was not fin-
ished before early July. The author was inexperienced in profiling, thing that did not
help this project at all. Additionally the fact that he did not know how to use paraver,
a quite hard to use tool, contributed in major research stalling. His inexperience and
paraver’s user interface combined together magnified the effects brought on the project.
These problems could have been avoided if it was already known how these tools can
be used for the needs of this project, before this project began. No matter how painful
this trial and error procedure was, it still was an educational process that gave results
which can also benefit other people to attempt working on profiling code.
43
It should be clear by now that the work for this project is a lotand has a lot of parameters
one should take care of. It is the author’s view that if this dissertation can help people
use HPM toolkit and paraver without troubles like the ones encountered in this project
then this project should be considered successful, more than simply delivering some
configuration files. The value of the work done turns out to liein the fact that both
tools can be tricky to use, while the actual process of creating configuration files is
easy. As computer hardware is outdated every few years, being able to use a powerful,
even though not perfect, tool like paraver to create configuration files is as important as
having the specific files for an architecture.
6.2 HPM evaluation
The HPM toolkit is pretty straightforward to use. If there were not multiple versions
on HPCx, then it could have been used within minutes. This would be the case if the
available documentation on the HPCx web site was updated. Hopefully one having read
this writing, can now quickly do so without searching multiple directory trees. As it is
backed up by IBM it is solid and it integrates very well with theHPCx environment.
If there were some more functionality to be added, then probably a way to define derived
metrics defined by the user is a feature that would be very useful. One more step could
be taken by creating a GUI to graphically view the values recorded, which would be
nice.
6.3 Paraver
Paraver is a very powerful tool with a lot of features which isalso given free of charge
even though it is not under any of the licenses approved by theOpen Source Initiative. It
is a great tool that can profile parallel codes and can easily provide a lot of information
about a code’s run time behaviour. This information would otherwise be difficult to be
obtained. Special thanks need to go to the Barcelona Supercomputing Centre (BSC),
namely Prof Jesus Labarta, Judit Gimenez, German Llort and Harald Servat for their
tutorials and their support.
There are several things that could be improved because paraver is not a perfect pro-
gramme like any other programme. To begin with, its stability needs to be seriously
44
improved. When it was run for this project it crashed very often and it occasionally
caused the operating system to crash. Maybe its GUI layout should change so that it is
easier to use. As it is, it is very difficult to understand whatthe available options, which
are a lot, are or it is hard to find which option implements a feature the user is thinking
of. This results in a user thinking paraver can not do something because he can not find
how to do it, while it actually can be done. After all a utilitythat has uses a GUI is
supposed to be easy to use. It is the writer’s belief that if itwas an open source code it
could greatly be assisted by an open sourced development model.
One feature that can be improved is the fact that the windows showing timeline views
of a profile or windows that contain values found in the trace are not always updated as
the user analyses a profile and the user needs to explicitly press theREDRAWbutton so
that paraver will update the values displayed in a view. The same applies when one tries
to resize a window, when one needs to manually ask paraver to rescale the view to fit
the window. There is no apparent reason why this is not done automatically. As regards
hardware counter usage, an other feature that could be addedand would make it easier
to use is to automatically pick or disable options dependingon other options. It would
also be very helpful if there was a way to manually trigger hardware counter recording
when instrumenting a code, just like libHPM does. Finally a command line interface
could also be of use to automate several tasks done repetitively on several profile files.
6.4 Future work
Future work obviously might involve creating and testing more configuration files for
HPCx. This probably could be done after questioning current users who are interested
in using these configuration files so as they can give a clear picture of their needs.
Their answers could indicate which other derived metrics would be the most useful.
Another work that could be done is to create configuration files for other architectures
and compare how several widely available codes perform on different architectures.
Moving in a slightly different direction than using paraverone could create programmes
or scripts to manipulate HPM’s output to obtain profile information based on hardware
counter information.
45
A Appendix
A.1 Source code and sample Makefile
This is the source code of the two codes that were used for the tests in this dissertation.Lammps’ source code, the third code used, is too long to be included here. Insteadonly the file that was instrumented is included here. There were two versions for eachprogramme, one instrumented using HPMlib and one using Paraver. There are also twosample Makefiles that can be used to compile a programme that is instrumented witheither HPMlib or Paraver.
Hello-libHPM
program hello
implicit none
include ’mpif.h’
#include "f_hpm.h"
integer:: ierror, i,j,rank
integer ,parameter:: N=2000,K=1000
real * 8 ::a(N),b(N)
call MPI_INIT(ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
call f_hpminit(rank, ’test4’)
a=1.123123d0
b=101.321456d0
print * , ’Hello, starting ’,N,’ adds of a_i+b_i’,K,’ times.’
call f_hpmstart(80, ’bna’)
do 10 j=1,K
do 20 i=1,N
b(i) = b(i) + a(i)
20 continue
10 continue
call f_hpmstop(80)
call MPI_BARRIER(MPI_COMM_WORLD,ierror)
46
print * , ’Starting ’,N,’ adds of 1+a(i)’,K,’ times.’
call f_hpmstart(90, ’an1’)
do 30 j=1,K
do 40 i=1,N
b(i) = 1.0 + a(i)
40 continue
30 continue
call f_hpmstop(90)
call MPI_BARRIER(MPI_COMM_WORLD,ierror)
call f_hpm_terminate(rank)
call MPI_FINALIZE(ierror)
end program hello
Hello, Paraver
program hello
implicit none
include ’mpif.h’
include ’ompitracef.h’
integer:: ierror,i,j
integer ,parameter:: N=2000,K=1000
real * 8::a(N),b(N)
call MPI_INIT(ierror)
call OMPITRACE_EVENT(6000019, 1)
a=1.123123d0
b=101.321456d0
print * , ’Hello, starting ’,N,’ adds of a_i+b_i’,K,’ times.’
call OMPITRACE_EVENT(6000019, 2)
call MPI_BARRIER(MPI_COMM_WORLD,ierror)
do 10 j=1,K
do 20 i=1,N
b(i) = b(i) + a(i)
47
20 continue
10 continue
call MPI_BARRIER(MPI_COMM_WORLD,ierror)
call OMPITRACE_EVENT(6000019, 1)
print * , ’Starting ’,N,’ adds of 1+a(i)’,K,’ times.’
call OMPITRACE_EVENT(6000019, 3)
call MPI_BARRIER(MPI_COMM_WORLD,ierror)
do 30 j=1,K
do 40 i=1,N
b(i) = b(i) + a(i)
40 continue
30 continue
call MPI_BARRIER(MPI_COMM_WORLD,ierror)
call OMPITRACE_EVENT(6000019, 1)
call MPI_FINALIZE(ierror)
end program hello
Makefile Paraver
FF=mpxlf90_r
SRC=test4.f90
EXE=hello
# Compiling and linking 32bit applications, some optimizat ions are welcome.
FFLAGS=-q32 -O2 #-qsuffix=cpp=f90 .NO MORE NEEDED.
# Compilation only flags. Now using paraver include dirs.
FCFLAGS=-I/hpcx/home/z004/z004/nspattak/Paraver_new /include/
# Link only flags. Now using paraver libs and ompitrace libra ry.
LFLAGS=-L/hpcx/home/z004/z004/nspattak/Paraver_new/ lib -lompitrace
#
# No need to edit below this line
#
.SUFFIXES:
.SUFFIXES: .f90 .o
48
OBJ= $(SRC:.f90=.o)
.f90.o:
$(FF) $(FFLAGS) $(FCFLAGS) -c $<
all: $(EXE)
$(EXE): $(OBJ)
$(FF) $(FFLAGS) $(LFLAGS) -o $@ $(OBJ)
$(OBJ): $(MF)
tar:
tar cvf $(EXE).tar $(MF) $(SRC) * .prv * .pcf inter.ll parallel.ll
clean:
rm -f $(OBJ) $(EXE) core
49
A.2 Sample Paraver .cfg file
The aflops.cfg file is included here as an example of a configuration file.
aflops.cfg
ConfigFile.Version: 3.4
ConfigFile.NumWindows: 3
ConfigFile.BeginDescription
Algebraic floating point operations.
This is the same as MFlop/s if floating divides and square roo ts are small.
aflops = (FPU executed one flop instruction
+ (2 * FPU executed mult-add instruction)) * 0.000001
Uses counter group 137
ConfigFile.EndDescription
################################################### ####################
< NEW DISPLAYING WINDOW FPU executed mult-add instruction >
################################################### ####################
window_name FPU executed mult-add instruction
window_type single
window_id 1
window_position_x 250
window_position_y 138
window_width 600
window_height 300
window_flags_enabled true
window_units Nanoseconds
window_maximum_y 18.000000
window_scale_relative 1.000000
window_object appl { 1, { All } }
window_begin_time_relative 0.000000
window_pos_to_disp 80
window_pos_of_x_scale 18
window_pos_of_y_scale 80
window_number_of_row 12
window_click_options 1 0 1 1 1 0
window_click_info 0 0 0 0 0
window_expanded true
window_open false
window_selected_functions { 14, { {cpu, Active Thd}, {appl , Adding},
{task, Adding}, {thread, Next Evt Val}, {node, Adding},
{system, Adding}, {workload, Adding}, {from_obj, All},
50
{to_obj, All}, {tag_msg, All}, {size_msg, All}, {bw_msg, A ll},
{evt_type, =}, {evt_value, All} } }
window_compose_functions { 9, { {compose_cpu, As Is}, {com pose_appl, As Is},
{compose_task, As Is}, {compose_thread, As Is},
{compose_node, As Is}, {compose_system, As Is},
{compose_workload, As Is}, {topcompose1, As Is}, {topcomp ose2, As Is} } }
window_analyzer_executed 0
window_analyzer_info 0.000000 0.000000 0 0
window_filter_module evt_type 1 42001054
################################################### #####################
< NEW DISPLAYING WINDOW FPU executed 1 flop instruction >
################################################### #####################
window_name FPU executed 1 flop instruction
window_type single
window_id 2
window_position_x 225
window_position_y 84
window_width 600
window_height 300
window_flags_enabled true
window_units Nanoseconds
window_maximum_y 18.000000
window_scale_relative 1.000000
window_object appl { 1, { All } }
window_begin_time_relative 0.000000
window_pos_to_disp 80
window_pos_of_x_scale 18
window_pos_of_y_scale 80
window_number_of_row 12
window_click_options 1 0 1 1 1 0
window_click_info 0 0 0 0 0
window_expanded true
window_open false
window_selected_functions { 14, { {cpu, Active Thd}, {appl , Adding},
{task, Adding}, {thread, Next Evt Val}, {node, Adding},
{system, Adding}, {workload, Adding}, {from_obj, All},
{to_obj, All}, {tag_msg, All}, {size_msg, All},
{bw_msg, All}, {evt_type, =}, {evt_value, All} } }
window_compose_functions { 9, { {compose_cpu, As Is}, {com pose_appl, As Is},
{compose_task, As Is}, {compose_thread, As Is},
51
{compose_node, As Is}, {compose_system, As Is},
{compose_workload, As Is}, {topcompose1, As Is}, {topcomp ose2, As Is} } }
window_analyzer_executed 0
window_analyzer_info 0.000000 0.000000 0 0
window_filter_module evt_type 1 42000056
################################################### ######################
< NEW DISPLAYING WINDOW Algebraic floating point operation s >
################################################### ######################
window_name Algebraic floating point operations
window_type composed
window_id 3
window_factors 2.000000 1.000000
window_operation add
window_identifiers 1 2
window_position_x 280
window_position_y 216
window_width 600
window_height 300
window_comm_lines_enabled false
window_color_mode window_in_null_gradient_mode
window_units Nanoseconds
window_maximum_y 3964506.000000
window_minimum_y 4.000000
window_scale_relative 1.000000
window_object appl { 1, { All } }
window_begin_time_relative 0.000000
window_pos_to_disp 597
window_pos_of_x_scale 18
window_pos_of_y_scale 80
window_number_of_row 12
window_click_options 1 0 1 1 1 0
window_click_info 1 9758503 11879915 11 10819209
window_expanded true
window_open true
window_compose_functions { 9, { {compose_cpu, As Is}, {com pose_appl, As Is},
{compose_task, As Is}, {compose_thread, As Is},
{compose_node, As Is}, {compose_system, As Is},
{compose_workload, As Is}, {topcompose1, As Is},
{topcompose2, As Is} } }
window_analyzer_executed 1
52
window_analyzer_info 0.000000 109677076.000000 1 12
53
References
[1] A.G. Sunderland, A.R. Porter, “Profiling Parallel Performance Using Vampir and Paraver”.
http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0704.pdf
[2] J.M. Bull, “Single Node Performance Analysis of Applications on HPCx”.
http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0703.pdf
[3] Michael Holden, “Performance Optimisation of an Environmental Modelling Code (POL-
COMS)“.
http://www.hpcx.ac.uk/research/hpc/technical_reports/HPCxTR0502.pdf
[4] What is the Pareto principle ?
http://www.gassner.co.il/pareto/
[5] Knuth, Donald.Structured Programming with go to Statements ACM Journal Computing Surveys,
Vol 6, No. 4, Dec. 1974. p.268.
[6] IBM Parallel Environment web page.
http://www-03.ibm.com/systems/p/software/pe/index.html
[7] Wikipedia - Hardware Performance Counters
http://en.wikipedia.org/wiki/Hardware_performance_counter
[8] Advanced Computing Technology Center.
https://domino.research.ibm.com/comm/research_projects.nsf/pages/
actc.index.html
[9] Paraver home page.
http://www.cepba.upc.es/paraver/
[10] HPCx in the top 500 list in June 2008.
http://www.top500.org/site/systems/2217
[11] IBM p5 575 web page.
http://www.ibm.com/systems/p/hardware/highend/575/index.html
[12] IBM High Performance Switch on System p5 575 Server, Performance White Paper.
ftp://ftp.software.ibm.com/common/ssi/sa/wh/n/psw03008usen/PSW03008USEN.PDF
[13] AIX 5L for POWER Version 5.3.
http://www-03.ibm.com/systems/p/os/aix/v53/index.html
[14] Tivoli Workload Scheduler LoadLeveler.
http://www-03.ibm.com/systems/clusters/software/loadleveler/
54
[15] HPM user guide on IBM’s web page.
https://domino.research.ibm.com/comm/research_projects.nsf/pages/
actc.hardwareperf2.html
[16] HPM user guide on HPCx user support web site.
http://www.hpcx.ac.uk/support/documentation/IBMdocuments/HPM.html
[17] Paraver FAQ on HPCx web site.
http://www.hpcx.ac.uk/support/FAQ/paraver.html
[18] HPC-Europa: A Beginner’s Guide to using Paraver on HPCx.
http://www.hpcx.ac.uk/support/FAQ/ParaverGuide.pdf
55