advanced hpc and linux 20130820

7/26/2019 Advanced Hpc and Linux 20130820

1/70


2/70

Advanced High

Perfomance Computing Using Linux

Advanced High PerformanceComputing Using Linux

TABLE OF CONTENTS

1.0 Advanced Linux

1.1 Emacs: The ost Advanced Text Edito

1.! Advanced "cripting

!.0 Advanced HPC

!.1 Computer "#stem Architectures

!.! Processors and Cores

!.$ Para%%e% Processing Performance

$.0 &ntroductor# P& Programming

$.1 The "tor# of the essage Passing &nterface 'P&( and )penP&

$.! Unrave%%ing A "amp%e P& Programming and )penP& *rappers

$.$ P&+s &ntroductor# ,outines

-.0 &ntermediate P& Programming

-.1 P& atat#pes


3/70

-.! &ntermediate ,outines

-.$ Co%%ective Communications

-.- erived ata T#pes

-./ Partic%e Advector

-. Creating A e2 Communicator

-.3 Profi%ing Para%%e% Programs

-.4 e5ugging P& App%ications

1.0 Advanced Linux

1.1 Emacs : The Most Advanced Text Editor

In the introductory course, we used nano as our text editor. In the intermediate

course, we used vim. Finally, in this advanced course, we'll provide an

introduction to Emacs, which can be described as the most advanced text editor

and is particularly popular among programmers.

Emacs is one of the oldest continious software applications available, first

written in 19! by "ichard #tallman, founder of the $%& free software

movement. t the time of writing it was up to version (), with a substantial

number of for*s and clones developed during its history.

+he big features of Emacs is the extremely high level of builtin commands,

customisation, and extensions, so extensive that those explored here only begin

to touch the extraordinary diverse world that is Emacs. Indeed, Eric "aymond,notes -i/t is a common 0o*e, both among fans and detractors of Emacs, to

describe it as an operating system masuerading as an editor-.

2ith extensions, Emacs includes 3a+ex formatted documents, syntax

highlighting for ma0or programming and scripting languages, a calculator, a

calender and planner, a textbased adventure, a webbrowser, a newsreader and


4/70

email client, and an ftp client. It provides file difference, merging, and version

control, a textbased adventure game, and even a "ogerian psychotherapist.

+ry doing that with %otepad4

+his all said, Emacs is not easily to learn for beginners. +he level of

customisation and the detailed use of meta and control characters does serve as

a barrier to immediate entry.

+his tutorial will provide a useable introduction to Emacs.

1.1.1 Starting Emacs

+he defaut of Emacs installations on contemporary 3inux systems assume the

use of a graphicaluser interface. +his is obviously not the case with an 567

system, but for those with a home installation you should be aware that 'emacs

nw' from the commandline will launch the program without the $&I. If you wish

to ma*e this the default you should add it as an alias to .bashrc 8e.g., alias

emacs'emacs nw':.

Emacs is launched by simply typing 'emacs' on the command line. 7ommands are

invo*ed by a combination of the 7ontrol 87trl: *ey and a character *ey 87;chr


5/70

+o -brea*- from a partially entered command, 7g.

If an Emacs session crashed recently, =x recoversession can recover the files

that were being edited.

+he menubar can be activated with =A

+he help files are accessed with 7h and the manual with 7h r.

1.1.3 Files, B!!ers, and "indo#s

Emacs has three main data structures, Files, Buffers, and 2indows which are

essential to understand.

file is the what is the actual file on dis*. #trictly, when using Emacs one does

not actually edit file. "ather, what happens is the file is copied into a buffer, then

edited, and then saved. Buffers can be deleted without deleting the file on dis*.

+he buffer is an data space within Emacs for editing a copy of the file. Emacs can

handle many buffers simultaneously, the effective limit being the maximum

buffer siCe, determined by integer capacity of the processor and memory 8e.g.,

for !)bit machines, this maximum buffer siCe is (D!1 ( bytes:. buffer has a

name, usually after the file from which it has copied the data.

window is the user's view of a buffer. %ot all buffers may be visible to the user

at once due to the limits of screen siCe. user may split the screen into multiple

windows. 2indows can be created and deleted, without deleting the bufferassociated with the window.

Emacs also has a blan* line below the mode line to display messages, and for

input for prompts from Emacs. +his is called the minibuffer, or echo.


6/70

1.1.$ Ex%loring and Entering Text

7ursor *eys can be used to mover around the text, along with 6age &p and 6age

own, if the terminal uses them. 5owever Emacs afficiandos will recommend the

use of the control *ey for speed. 7ommon commands include the following? you

may notice a pattern in the command logic

7v 8move page down:, =v 8move page up:

7p 8move previous line:, 7n 8move next line:,

7f 8move forward, one character:, 7b 8move bac*ward, one character:

=f 8move forward, one word:, =b 8move bac*ward, one word:

7a 8move to beginning of a line:, 7e 8move to end of a line:

=a 8move forward, beginning of a sentence:

=e 8move bac*ward, beginning of a sentence:=G 8move bac*ward, beginning of a paragraph:, =H 8end of paragraph:

=; 8move move tbeginning of a text:, =< 8move end of a text:.

;bac*space< 8delete the character 0ust before the cursor :

7d 8delete the character on the cursor:

=;bac*space< 8cut the word before the cursor:

=d 8cut the word after the cursor :

7* 8cut from the cursor position to end of line :

=* 8cut to the end of the current sentence :

7 8prefix command? use when you want to enter a control *ey into the buffer

e.g,. 7 E#7 inserts an Escape:

3i*e the pageup and pagedown *eys on a standard *eyboard you will discoverthat Emacs also interprets the Bac*space and elete *ey as expected.

selection can be cut 8or '*illed' in Emacs lingo: by mar*ing the beginning of the

selected text with 7#67 8space: and ending it with with standard cursor

movements and entering 7w. +ext that has been cut can be pasted 8'yan*ed': by

moving the cursor to the appropriate location and entering 7y.


7/70

Emacs commands also accept a numeric input for repetition, in the form of 7u,

the number of times the command is to be repeated, followed bythe command

8e.g., 7u 7n moves eight lines down the screen.:

1.1.& File Management

+here are only three main file manipulation commands that a user needs to

*now? how to find a file, how to save a file from a buffer, and how to save all.

+he first command is 7x 7f, shorthand for -findfile-. t first this command

chec*s prompts for the name of the file. If it is already copied into a buffer it willswitch to that buffer. If it is not, it will create a new buffer with the name

reuested.

For the second command, to save a buffer to a file file with the buffer name use

7x 7s, shorthand for -savebuffer-.

+he third command is 7x s. +his is shorthand for -savesomebuffers- and will

cycle through each open buffer and prompt the user for their action 8save, don't

save, chec* and maybe save, etc:

1.1.' B!!er Management

+here are four main commands relating to buffer management that a user needs

to *now. 5ow to switch to a buffer, how to list existing buffers, how to *ill a

buffer, and how to read a buffer in readonly mode.

+o switch to a buffer, user 7x b. +his will prompt for a buffer name, and switch

the buffer of the current window to that buffer. It does not change your existing

windows. If you type a new name, it will create a new empty buffer.


8/70

+o list current active buffers, use 7x 7b. +his will provide a new window which

lists current buffers, name, whether they have been modified, their Cie, and the

file that they are associated with.

+o *ill a buffer, use 7x *. +his will prompt for the buffer name, and then remove

the data for that buffer from Emacs, with an opportunity to save it. +his does not

delete any associated files.

+he toggle read only mode on a buffer, use 7x 7.

1.1.( "indo# Management

Emacs has it's own windowing system, consisting of several areas of framed text.

+he behaviour is similar to a tiling window manager? none of the windows

overlap with each other.

7ommonly used window commands include

7x J delete the current window

7x 1 delete all windows except the selected window

7x ( split the current window horiContally

7x K split the current window vertically

7x D ma*e selected window taller

7x H ma*e selected window wider

7x G ma*e selected window narrower

7x L ma*e all windows the same height

command use is to bring up other documents or menus. For example, with the*ey seuence 7h one usually calls for help files. If this is followed by *, it will

open a new vertical window, and with 7f, it will display the help information for

the command 7f 8i.e., 7h * 7f:. +his new window can be closed with 7x 1.

1.1.) *ill and +an, Search-e%lace, /ndo


9/70

Emacs is notable for having a Mvery largeN undo seuence, limited by system

resources, rather than application resources. +his undo seuence is invo*ed with

7O 8control underscore:, or with 7x u . 5owever it has a special feature that, by

engaging in a simple navigation command 8e.g., 7f: the undo action is pushed to

the top of the stac* and therefore the user can undo an undo command.

1.1.0 ther Featres

Emacs can ma*e it easier to read 7 and 7LL by colourcoding such files,

through the PQ.emacs configuration file, and adding Mglobalfontloc*mode tN.

6rogrammers also find the feature on being able to run the $%& debugger 8$B:

from within Emacs as well. +he command =x gdb will start up gdb. If thereRs abrea*point, Emacs automatically pulls up the appropriate source file, which gives

a better context than the standard $B.

1.2 Advanced Scri%ting

$ood *nowledge of scripting is reuired for any advanced 3inux user, and

especially those who find that they have regular tas*s, such as the processing of

data through a program. #hell scripting is no terribly difficult, although

sometimes some austere syntax bugs may prove frustrating but the machine is

0ust doing what you as*ed it to. espite their often underrated utility shell

scripts are not the answer to everything. +hey are not great at resource intensive

tas*s 8e.g., extensive file operations: where speed is important. +hey are not

recommended for heavyduty maths operations 8use 7, 7LL, or Fortran instead:.

It is not recommended in situations where data structures, multidimensional

arrays 8it's not a database4: and portQsoc*et IQ> is important.

In the Intermediate 7ourse, we loo*ed at scripting in reference to regular

expression utilities, such as sed and the programming language aw*, along with

some simple examples of using 3inux command invocations as variables in a

bac*up script, some sample -for-, -while-, -doQdone- and -until- loops along with

simple, optional, ladder, and nested conditionals using -if-, -then-, -else-, -else-,

-elif- and -fi-, the use of -brea*- and -continue-, and the -case- conditional, and

-select- for user input. +he implementation of these into 6B# 0ob submission


10/70

scripts was also illustrated. In this dvanced course we will revisit these

concepts but with more sophisticated and complex examples. In addition there

will be a close loo* at internal commands and filters, process substitution,

functions, arrays, and debugging.

1.2.1 Scri%ts "ith ariales

+he simplest script is simply one that runs a list of system commands. t least

this saves the time of retyping the seuence each time it is used, and reduces the

possibility of error. For example, in the Intermediate course, the following script

was recommended to calculate the dis* use in a directory. It's a good script, very

handy, but how often would you want to type itS Instead, type enter it once and

*eep it. Tou will recall of course, that a script starts with an invocation of the

shell, followed by commands.

emacs dis*use.sh

U4QbinQbash

du s* V W sort nr W cut f( W xargs d -Xn- du sh < dis*use.txt

7x 7c, y for save

chmod Lx dis*use.sh

s described in the Intermediate course, script runs a dis* usage in summary,

sorts in order of siCe and exports to the file dis*use.txt. +he -Xn- is to ignore

spaces in filenames.

=a*ing the script a little more complex, variables are usually better than hard

coded values. +here are two potential variables in this script, the wildcard 'V' and

the exported filename -dis*use.txt-. In the former case, we'll *eep the wildcared

as it allows a certain portibility of the script it can run in any directory it is

invo*ed from. For the latter case however, we'll use the date command so that a

history of dis*use can be created which can be reviewed for changes. It's also


11/70

good practise to alert the user when the script is completed and, although it is

often necessary, it is also good practise to cleanly finish any script with with

'exit'.

emacs dis*use.sh

U4QbinQbash

&dis*useY8date LZTZmZd:.txt

du s* V W sort nr W cut f( W xargs d -Xn- du sh < Y&

echo -is* summary completed and sorted.-

exit

7x 7c, y for save

1.2.2 ariales and Conditionals

nother example is a script with conditionals as well as variables. common

conditional, and sadly often forgotten, is whether or not a script has the reuiste

files for input and output specified. If an input file is not specified a script that

performs an action on the file will simple go idle and never complete. If an

output file is hardcoded, then the person running the script runs the ris* of

overwriting a file with the same name, which could be a disaster.

+he following script searches through any specified text file for text before and

after the ubiuitous email -[- symbol and outputs these as a csv file through use

of grep, sed, and sort 8for neatness:. If the input or the output file are not

specified, it exits after echoing the error.

emacs findemails.sh

#!/bin/bash

# Search for email addresses in file, extract, turn into csv with designated file

name

INPUT=${1}

OUTPUT=${2}


12/70

{

if [ !$1 -o !$2 ]; then

echo "Input file not found, or output file not specified. Exiting script."

exit 0

fi

}

grep --only-matching -E '[.[:alnum:]]+@[.[:alnum:]]+' $INPUT > $OUTPUT

sed -i 's/$/,/g' $OUTPUTsort -u $OUTPUT -o $OUTPUT

sed -i '{:q;N;s/\n/ /g;t q}' $OUTPUT

echo "Data file extracted to" $OUTPUT

exit

7x 7c, y for save

chmod Lx findemails.sh

+est this file with hidden.txt as the input text and found.csv as the output text.

+he output will include a final comma on the last line but this is potentially useful

if one wants to run the script with several input files and append to the same

output file 8simply change the single redirection in the grep statement to an

double appended redirection.

serious wea*ness of the script 8so far: is that it will gather any string with the'[' symbol in it, regardless of whether it's a wellformed email address or not. #o

it's not uite suitable for screenscraping usenet for email address to turn into a

spammers list. But it's getting close.

1.2.3 eads

+he read command simply reads a line from standard input. By applying the noption is can read in a number of characters, rather than a whole line, so n1 is

-read a single character-. +he use of the r option reads the input as raw input,

so that the bac*slash *ey 8for example: doesn't act li*e a a newline escape

character, and the p option displays the prompt. 6lus, a t timeout in seconds

option can also added. 7ombined, can be used in the effect of -press any *ey to

continue-, with a limited timeframe.


13/70

dd the following to findemails.sh at the end of the file.

emacs findemails.sh

#!/bin/bash

# Search for email addresses in file, extract, turn into csv with designated file

name

..

..

read -t5 -n1 -r -p "Press any key too see the list, sorted and with unique

record..."

if [ $? -eq 0 ]; then

echo A key was pressed.

else

echo No key was pressed.

exit 0

fi

less $OUTPUT | \

# Output file, piped through sort and uniq.

sort | uniq

exit

7x 7x, y for save

1.2.$ S%ecial Characters

#cripts essentially consist of commands, *eywords, and special characters.

#pecial characters have meaning beyond their literal meaning 8a metameaning,

if you li*e:. 7omments are the most common special meaning.


14/70

ny text following a U 8with the exception of U4: is comments and will not be

executed. 7omments may begin at the beginning of a line, following whitespace,

following the end of a command, and even be embedded within a piped command

8as above in section K:.

comment ends at the end of the line, and as a result a command may not follow

a comment on the same line. uoted or an escaped U in an echo statement

does not begin a comment.

nother special characters includes the command seperator, a semicolon, which

is used to permit two or more commands on the same line. +his is already shown

by the the various tests in the script 8e.g., if [ !$1 -o !$2 ]; thenand if [ $? -eq0 ]; then:. %ote the space after the semicolon. In contrast a double semicolon 8??:

represents a terminator in a case option, which was encountered in the extract

script in the Intermediate course.

..

case $1 in

*.tar.bz2) tar xvjf $1 ;;

*.tar.gz) tar xvzf $1 ;;

*.bz2) bunzip2 $1 ;;

..

..

esac

In contrast, the colon acts as a null command. 2hilst this obviously has a variety

of uses 8e.g., an alternative to the touch command, a really practical advantage

of this is that comes with a true exit status, and as such it can be used as

placeholder in ifQthen tests. n example from the Intermediate course?for i in *.plot.dat; do

if [ -f $i.tmp ]; then

: # do nothing and exit if-then

else

touch $i.tmp

+he use of the null command as a test at the beginning of a loop will cause it to

run endlessley 8e.g., ;code


15/70

evaluates as true. %ote that the colon is also used as a field separator in

QetcQpasswd and in the Y6+5 variable.

dot 8.: has multiple special character uses. s a command it sources a

filename, importing the code into a script, rather li*e the Uinclude directive in a

7 program. +his is very useful in situations when multiple scripts use a common

data file, for example 8e.g., . hidden.txt:. s part of a filename of course, as was

shown in the Introductory course, the . represents the current wor*ing directory

8e.g., cp r QpathQtoQdirectoryQ . and of course, .. for the parent directory:. third

use for the dot is in regular expressions, matching one character per dot. final

use is multiple dots in seuence in a loop. e.g.,

for a in {1..10}

do

echo -n "$a "

done

3i*e the dot, the comma operator has multiple uses. &sually it is used to lin*

multiple arithmetic calculations. +his is typically used in for loops, with a 7li*e

syntax. e.g.,

for ((a=1, b=1; a


16/70

doubleuote on a value does not change variable substitution. +his is

sometimes referred to as wea* uoting. &sing single uotes however, means the

variable to be used literally, with no substitution. +his is often referred to as

strong uoting. For example, a strict single uoted directory listing of ls with a

wildcard will only provide files that are expressed by the symbol 8which isn't a

very good file name:. 7ompare ls V with ls 'V'. +his example will also worth with

double uote and indeed, doubleuotes are generally preferable as they prevent

reinterpretation of all special characters except Y, A, and X. +his are usually the

symbols which are wanted in their interpreted mode. s the escape character

has a literal interpretation with single uotes, enclosing a single uote within

single uotes will not wor* as expected.Enclosing a referenced value in double

uotes 8- ... -: does not interfere with variable substitution. +his is called partial

uoting, sometimes referred to as -wea* uoting.- &sing single uotes 8' ... ':

causes the variable name to be used literally, and no substitution will ta*e place.+his is full uoting, sometimes referred to as 'strong uoting.'

"elated to uoting is the use of the bac*slash 8X: used to escape single

characters. o not confuse it with the forward slash 8Q: has multiple uses as both

the separator in pathnames 8e.g., 8QhomeQtrainJ1:, but also a the division

operator.

In some scripts bac*tic*s 8A: are used for command substitution, where the

output of a command can be assigned to a variable. 2hilst this is not a 6>#I\

standard, it does exist for historical reasons. %esting commands with bac*tic*s

also reuires escape characters? the deeper the nesting the more escape

characters reuired 8e.g., echo Aecho XAecho XXXApwdXXXAXAA:. +he preferrred and

6>#I\ standard method is to use the dollar sign and parentheses. e.g., echo

-5ello, Y8whoami:.- rather than echo -5ello, Awhoami:A.-

2.0 Advanced HPC

2.1 Computer S!tem Arc"itecture!


17/70

s explained in the first, introductory, course, -highperformance computing

8567: is the use of supercomputers and clusters to solve advanced computation

problems-. ll supercomputers 8-a nebulous term for computer that is at the

frontline of current processing capacity-: in contemporary times use parallel

computing, -the submission of 0obs or processes over one or more processors

and by splitting up the tas* between them-.

It is possible to illustrate the degree of parallelisation by using Flynn's +axonomy

of 7omputer #ystems 819!!:, where each process is considered as the execution

of a pool of instructions 8instruction stream: on a pool of data 8data stream:.

From this complex is four basic possibilities

"ing%e &nstruction "tream6 "ing%e

ata "tream '"&"(

"ing%e &nstruction "tream6 u%tip%e

ata "treams '"&(

u%tip%e &nstruction "treams6 "ing%e

ata "tream '&"(

u%tip%e &nstruction "treams6

u%tip%e ata "treams '&(

2.1.1 Single 4nstrction Stream, Single 5ata Stream 6S4S57

(Image from Oracle Essentials, 4th edition, O'Reilly Media, 2007)

+his is the simplest and, until recently, the most common processor

architecture on des*top computer systems. lso *nown as a

uniprocessor system it offers a single instruction set and a single

data stream. &niprocessors could however simulate or include

concurrency through a number of different methods

a: It is possible for a uniprocessor system to run processes

concurrently by switching between one and another.

b: #uperscale instruction level parallelism can be used on uniprocessors. =ore

than one instruction during a cloc* cycle is simultaneously dispatched to

different functional units on the processor.


18/70

c: Instruction prefetch, where an instruction is reuested from main memory

before it is actually needed and placed in a cache. +his often also includes a

prediction algorithm of what the instruction will be.

d: 6ipelines, on the instruction level or the graphics level, can also serve as an

example of concurrent activity. n instruction pipeline 8e.g., "I#7: allows

multiple instructions on the same circuty by dividing the tas* into stages.

graphics pipeline implements different stages of rendering operations to

different arithmetic units.

2.1.2 Single 4nstrction Stream, Mlti%le 5ata Streams 6S4M57

#I= architecture represents a situation where a single processor performs the

same instruction on multiple data streams. +his commonly occurs in

contemporary multimedia processors, for example ==\ instruction set from the

199Js, which lead to =otorollaRs 6ower67 ltivec, and more contemporary times

]E 8dvanced ]ector Extensions: instruction set used in Intel #andy Bridge

processors and ='s BulldoCer processor. +hese developments have primarily

been orientated towards realtime graphics, using shortvectors. 7ontemporary

supercomputers are invariably =I= clusters which can implement shortvector

#I= instructions.

#I= was also used especially in the 19Js and notably on the various 7ray

systems. For example the 7ray1 819!: had eight -vector registers,- which held

sixtyfour !)bit words each 8long vectors: with instructions applied to the

registers. 6ipeline parallelism was used to implement vector instructions with

separate pipelines for different instructions, which themselves cuold be run in

batch and pipelined 8vector chaining:. s a result the 7ray1 could have a pea*

performance of ()J mflops extraordinary for the day, and even acceptable in

the early (JJJs.

#I= is also *nown as vector processing or data parallelism, in comparison to a

regular #I= 76& which operates on scalars. #I= lines up a row of scalar data

8of uniform type: as a vector and operates on it as a unit. For example, inverting

an "$B picture to produce its negative, or to alter its brightness etc. 2ithout

#I= each pixel would have to be fetched to memory, the instruction applied to


19/70

it, and then returned. 2ith #I= the same instruction is applied to all the data,

depending on the availability of cores, i.e., get n pixels, apply instruction, return.

+he main disadvantages of #I=, within the limitations of the process itself, is

that it does reuire additional register, power consumption, and heat.

2.1.3 Mlti%le 4nstrction Streams, Single 5ata Stream 6M4S57

=ultiple Instruction, #ingle ata 8=I#: occurs when different operations are

performed on the same data. +his is uite rare and indeed debateable as it is

reasonable to claim that once an instruction has been performed on the data, it's

not thesame data anymore. If one doesn't ta*e this definition and allows for a

variety of instructions to be applied to the same data which can change then

various pipeline architectures can be considered =I#.

#ystolic arrays are another form of =I#. +hey are different to pipelines because

they have nonlinear array structure, they have multidirectional data flow, and

each processing element may even have its own local memory . In this situation a

matrix pipe networ* arrangement of processing units compute data and store it

independently of each other. =atrix multiplication is an example of such an array

in an algorithmic form, where one a matric is introduced one row at a time from

the top of the array, whereas another matrix is introduced one colum at a time.

=I# machines are rare? the 7isco 6\F processor is an example. +hey can be

fast and scalable, as they do operate in parallel, but they are reallydifficult to

build.

3.1.$ Mlti%le 4nstrction Streams, Mlti%le 5ata Streams 6M4M57

=ultiple Instruction, =ultiple ata 8=I=: have independent and asynchronous

processes that can operate on a number of different data streams. +hey are now

the mainstream in contemporary computer systems and thus can be further

differentiated between multiprocessor computers and their extension,

multicomputer mutiprocessors. s the name clearly indicates, the former refers

to single machines which have multiple processors and the latter to a cluster of

these machines acting as a single entity.


20/70

=ultiprocessor systems can be differentiated between shared memory and

distributed memory. #hared memory systems have all processors connected to a

single pool of global memory 8whether by hardware or by software:. +his may be

easier to program, but it's harder to achieve scalability. #uch an architecture is

uite common in single system unit multiprocessor machines.

2ith distributed memory systems, each processor has its own memory. Finally,

another combination is distributed shared memory, where the 8physically

separate: memories can be addressed as one 8logically shared: address space.

variant combined method is to have shared memory within each multiprocessor

node, and distributed between them.

3.2 8rocessors and Cores

2.2.1 /ni- and Mlti-8rocessors

further distinction needs to be made between processors and cores.

processor is a physical device that accepts data as input and provides results as

output. uniprocessor system has one such device, although the definitions can

become ambiguous. In some uniprocessor systems it is possible that there is

more than one, but the entities engage in separate functions. For example, a

computer system that has one central processing unit may also have a co

processor for mathematic functions and a graphics processor on a separate card.

Is that system uniprocessorS rguably not as the coprocessor will be seen as

belonging to the same entity as the 76&, and the graphics processor will have

different memory, system IQ>, and will be dealing with different peripherals. In

contrast a multiprocessor system does share memory, system IQ>, and

peripherals. But then the debate will become mur*y with the distinction between

shared and distributed memory discussed above.

2.2.2 /ni- and Mlti-core

In addition to the distinction between uniprocessor and multiprocessor there is

also the distinction between unicore and multicore processors. unicore


21/70

processor carries out the usual functions of a 76&, according to the instruction

set? data handling instructions 8set register values, move data, read and write:,

arithmetic and logic functions 8add, subtract, multiply, divide, bitwise operations

for con0unction and dis0unction, negate, compare:, and controlflow functions

8conditionally branch to another section of a program, indirectly branch and

return:. multicore processor carries out the same functions, but with

independent central processing units 8note lower case: called 'cores'.

=anufacturers integrate the multiple cores onto a single integrated circuit die or

onto multiple dies in a single chip pac*age.

In terms of theoretical architecture, a uniprocessor system could be multicore,

and a multiprocessor system could be unicore. In practise the most common

contemporary architecture is multiprocessor and multicore. +he number of cores

is represeneted by a prefix. For example, a dualcore processor has two cores

8e.g. = 6henom II $, Intel 7ore uo:, a uadcore processor contains fourcores 8e.g. = 6henom II $, Intel iK, i^, and i:, a hexacore processor

contains six cores 8e.g. = 6henom II \!, Intel 7ore i Extreme Edition 9J\:,

an octocore processor or octacore processor contains eight cores 8e.g. Intel

\eon E((J, = F\K^J: etc.

2.2.3 /ni- and Mlt-Threading

In addition to the distinctions bewteen processors and cores, whether uni or

multi, there is also the uestion of threads. n execution thread is the smallest

processing unit in an operating system. thread is typically contained inside a

process. =ultiple threads can exist within the same process and share resources.

>n a uniprocessor, multithreading generally occurs by switching between

different threads engaging in timedivision multiplexing with the processor

switching between the different threads, which may give the apperance that the

as* is happening at the same time. >n a multiprocessor or multicore system,

threads become truly concurrent, with every processor or core executing a

separate thread simultaneously.

2.2.$ "h9 4s 4t A Mlticore Ftre

Ideally, don't we want clusters of multicore multiprocessors with multithreaded

instructionsS >f course we do? but thin* of the heat that this generates, thin* of


22/70

the potential for race conditions 8e.g., deadloc*s, data integrity issues, resource

conflicts, interleaved execution issues:.

+here are all fundamental problems with computer architecture.

>ne of the reasons that multicore multiprocessor clusters have become popular

is that cloc* rate has pretty much stalled. part from the physical reasons, it is

uneconomical. It's simply not worth the cost increasing the freuency of cloc*

rate in terms of the power consumed and the heat dissipitated. Intel calls the

rateQheat tradeoff a -fundamental theorem of multicore processors-.

%ew multicore systems are being developed all the time. &sing "I#7 76&s,

+ilera released !)core processors in (JJ9 and in (JJ9, a one hundred core

processor. In (J1( +ilera founder, r. garwal, is leading a new =I+ effort

dubbed +he ngstrom 6ro0ect. It is one of four "6funded efforts aimed at

building exascale supercomputers. +he goal is to design a chip with 1,JJJ cores.


23/70

2.# Para$$e$ Proce!!in% Per&ormance

2.3.1 S%eed% and ;ocs

6arallel programming and multicore systems should mean better performance.

+his can be expressed a ratio called speedup

#peedup 8p: +ime 8serial:Q +ime 8parallel:

+his is varied by the number of processors # +81:Q+8p:, where +8p: represents

the execution time ta*en by the program running on p processors, and +81:

represents the time ta*en by the best serial implementation of the applicationmeasured on one processor.

3inear, or ideal, speedup is when #8p: p. For example, double the processors

resulting in double the speedup.

5owever parallel programming is hard . =ore complexity more bugs.

7orrectness in parallelisation reuires synchronisation 8loc*ing:.

#ynchronisation and atomic operations causes loss of performance,

communication latency. probable issue in parallel computing is deadloc*s,

where two or more competing actions are each waiting for the other to finish,

and thus neither ever does. n apocraphyl story of a _ansas railroad statue

radically illustrates the problem of a deadloc*

"hen t!o trains aroach each other at a crossing, #oth shall come to a f$ll

sto and neither shall start $ again $ntil the other has gone%"

8 similar example is a liveloc*? the states of the processes involved in the

liveloc* constantly change with regard to one another, none progressing:.

3oc*s are currrently manually inserted in typically programming languages?

without loc*s programs can be put in aninconsistent state. =ultiple loc*s in


24/70

different places and orders can lead to deadloc*s. =anual loc* inserts is error

prone, tedious and difficult to maintain. oes the programmer *now what parts

of a program will benefit from parallelisationS +o ensure that parallel execution

is safe, a tas*Rs effects must not interfere with the execution of another tas*.

2.3.2 Amdahl


25/70

where each pixel is rendered independently. #uch tas*s are often called

-pleasingly parallel-. +o give an example using the " programming language the

#%>2 pac*age 8#imple %etwor* of 2or*stations: pac*age allows for

embarrassingly parallel computations 8yes, we have this installed:.

2hilst originally expressed by $ene mdahl in 19!, it wasn't until over twenty

years later in 19 that an alternative by `ohn 3. $ustafson amd Edwin 5. Barsis

was proposed. $ustafon noted that madahl's 3aw assumed a computation

problem of fixed data set siCe. $ustafson and Barsis observed that programmers

tend to set the siCe of their computational problems according to the available

euipment? therefore as faster and more parallel euipment becomes available,

larger problems can be solved. +hus scaled speedup occurs? although mdahl's

law is correct in a fixed sense, it can be circumvented in practise by increasing

the scale of the problem.

If the problem siCe is allowed to grow with 6, then the seuential fraction of the

wor*load would become less and less important. common metaphor is based

on driving 8computation:, time, and distance 8computational tas*:. In mdhal's

3aw, if a car had been travelling )J*mpQh and needs to reach a point J*m from

the point of origin, no matter how fast the vehicle travels it will can only reach a

maximum of a J*mQh average before reaching the J*m point, even if it

travelled at infinite speed as the first hour has already passed. 2ith the

$ustafonBarsis 3aw, it doesn't matter if the first hour has been at a plodding )J

*mQh, this can be infinitely increased given enough time and distance. `ust ma*e

the problem bigger4


26/70

Image from Wikipedia

#.0 'ntroductor to (P' Pro%rammin%

3.1 The Stor9 o! the Message 8assing 4nter!ace 6M847 and %enM84

+he =essage 6assing Interface 8=6I: is a widely used standard, initially

designed by academia and industry initiated in 1991, to run on parallel

computers. +he goal of the group was to ensure sourcecode portability, and as aresult they have a standard that defines an interface and specific functionality. s

a standard, syntax and semantics are defined for core library routines which

allow for programmers to write messagepassing programs in Fortran or 7.

#ome implementations of these core library routine specifications are available

as free and opensource software, such as >pen =6I. >pen =6I combined three


27/70

previous well*nown implementations, namely F+=6I from the &niversity of

+ennessee, 3=6I from 3os lamos %ational 3aboratory, and 3=Q=6I from

Indiana &niversity, each of which excelled in particular areas, with additional

contributions from the 67\=6I team at the &niversity of #tuttgart. >pen=6I

combines the uality peerreview of a scientific free and opensource software

pro0ect, and has been used in many of the world's top ran*ing supercomputers.

=a0or milestones in the development of =6I include the following

V 1991 ecision to initiate #tandards for =essage 6assing in a istributed

=emory Environment

V 199( 2ors*hop on the above held.

V 199( 6reliminary draft specification released for =6I

V 199) =6I1. #pecification, not an implementation. 3ibrary, not a language.esigned for 7 and Fortran .

V 199 =6I(. Extends messagepassing model to include parallel IQ>, includes

7LLQFortran9J. Interaction with threads, and more.

V (JJ =6I Forum reconvened? =6IK development.

V +he standard utilised in this course is =6I(.

+he messaage passing paradigm, as it is called, is attractive as it is portable on a

wide variety of distributed architectures, including distributed and sharedmemory multiprocessor systerms, networ*s of wor*stations, or even potentially a

combination thereof. lthough originally designed for distributed architectures

8unicore wor*stations connected by a common networ*: which were popular at

the time the standard was initiated, shared memory symmetric multiprocessing

systems over networ*s created a hybrid distributedQshared memory systems, that

is each system has shared memory within each machine but not the memory

distributed between machines, which distribute data over the networ*

communications. +he =6I library standards and implementations were modified

to handle both types of memory architectrues.

(image from

&a!rence

&iermore ational

&a#oratory, %*%+)


28/70

&sing =6I is a matter of some common sense. It is is the only message passing

library which can really be considered a standard. It is supported on virtually all

567 platforms, and has replaced all previous message passing libraries, such as

6]=, 6"=7#, E&I, %\, 7hameleon, to name a few predecessors.

6rogrammers li*e it because there is no need to modify their source code when

ported to a different system as long as that system also supports the =6I

standard 8there may be other reasons however to modify the code4:. =6I has

excellent performance with vendors able to exploit hardware features for

optimisation.

+he core principle is that many processors should be able cooperate to solve a

problem by passing messages to each through a common communications

networ*. +he flexible architecture does overcome serial bottlenec*s, but it alsodoes reuire explicit programmer effort 8the -uesting beast- of automatic

parallelisation remains somewhat elusive:. +he programmer is responsible for

identifying opportunities for parallelism and implementing algorithms for

parallelisation using =6I.

=6I programming is best where there is not too many small communications,

and where coarselevel brea*up of tas*s or data is possible.

"In cases !here the data layo$t is fairly simle, and the comm$nications

atterns are reg$lar this data-arallel. is an e/cellent aroach% o!eer, !hen

dealing !ith dynamic, irreg$lar data str$ct$res, data arallel rogramming can

#e diffic$lt, and the end res$lt may #e a rogram !ith s$#-otimal erformance%"

(arren, Michael *%, and 1ohn % *almon% "+ orta#le arallel article rogram%" 3om$ter hysics

3omm$nications 57%6 (68)9 2::-20%)

3.2 /nravelling A Sam%le M84 8rogram and %enM84 "ra%%ers

For the purposes of this course, copy a number of files to the home directory


29/70

cd ~

cp -r /common/advcourse .

In the Intermediate course, an example mpihelloworld.c program was illustrated

with an associated 6B# script. 3etRs recall what that included and the

explanation in the 7 program and in the 6B# script that launched it.

+his is the text for mpihelloworld.c

#include standard include for 7 programs.

#include "mpi.h" standard include for =6I

programs.int main( argc, argv ) 7eginning of the main function6

esta5%ish arguments and vector. To

incorporate input fi%es argc

'argument count( is the num5er of

arguments6 and argv 'argument

vector( is an arra# of characters

representing the arguments.

int argc; Argument count is an integer

char **argv; Argument vector is a string of

characters.

{

int rank, size; "et ran8 and si9e from the inputs.

MPI_Init( &argc, &argv );

Initialises the =6I executionenvironment. +he input parameters

argc is pointer to the number of

arguments and argv is a pointer to

the argument vector

MPI_Comm_size( MPI_COMM_WORLD,

&size );etermines the siCe of the group

associated with a communicator. In


30/70

input parameter is simply a handle

8Contains a%% of the processes(, the

output parameter, siCe, is an

integer of the number of processes

in the group.

MPI_Comm_rank( MPI_COMM_WORLD,

&rank );s above, except ran* is ran* of the

calling process.

printf( "Hello world from

process %d of %d\n", rank, size );Printing He%%o 2or%d from each

process.

MPI_Finalize(); +erminates =6I execution

environment

return 0; A successfu% program finishes;

}

It is compiled into an executable with the command

mpicc -o mpi-helloworld mpi-helloworld.c

+his is the text for the batch file pbshelloword which is launched sub and

reviewed with less.

qsub pbs-helloworld

less pbs-helloworld

+he sample -hello world- program should be understandable to any 7

programmer 8indeed, any programmer: and with the =6Ispecific annotations, it

should be clear what is going on. It is the same as any other program, but with a

few =6Ispecific additions. For example, one can chec* the 6$I mpi.h with the

following

less /usr/local/openmpi/1.6.3-pgi/include/mpi.h


31/70

=6I compiler wrappers are used to compile =6I programs which perform basic

error chec*ing, integrate the =6I include files, lin* to the =6I libraries and pass

switches to the underlying compiler. +he wrappers are as follows

mpif >pen =6I Fortran wrapper compiler

mpif9J >pen =6I Fortran 9J wrapper compiler

mpicc >pen =6I 7 wrapper compiler

mpicxx >pen =6I 7LL wrapper compiler

>pen =6I is comprised of three software layers >63 8>pen 6ortable ccess

3ayer:, >"+E 8>pen "un+ime Environment:, and >=6I 8>pen =6I:. Each layerprovides the following wrapper compilers

>63 opalcc and opalcLL

>"+E ortecc and ortecLL

>=6I mpicc, mpicLL, mpicxx, mpi77 8only on systems with

casesenstive file systems:, mpif, and mpif9J. %ote that

mpicLL, mpicxx, and mpi77 all invo*e the same underlying

7LL compiler with the same options. ll are provided ascompatibility with other =6I implementations.

+he distinction between Fortran and 7 routines in =6I are fairly minimal. ll the

names of =6I routines and constants in both 7 and Fortran begin with the same

=6IO prefix. +he main differences are

V +he include files are slightly different in 7, mpi.h, in Fortan, mpif.h.V Fortran =6I routine names are in uppercase 8e.g., =6IOI%I+:, whereas 7

compatible =6I routine names are upper and lowercase 8e.g., =6IOInit:.

V +he arguments to =6IOInit are different? an =6I 7 program can ta*e advantage

of commandline arguments.

V +he arguments in =6I 7 functions are more strongly typed than they are in

Fortran, resulting in specific types in 7 8e.g., =6IO7omm, =6IOatatype:

whereas =6I Fortran uses integers.


32/70

V Error codes are returned in a separate argument for Fortran as opposed to the

return value for 7 functions.

7onsider the mpihelloworld program in Fortran 8mpihelloworld.f:

! Fortran MPI Hello World comment

program hello 6rogram name

include 'mpif.h' Include file for =6I

integer rank, size, ierror, tag,

status(MPI_STATUS_SIZE)]ariables

call MPI_INIT(ierror) #tart =6I

call

MPI_COMM_SIZE(MPI_COMM_WORLD, size,

ierror)

%umber of processers

call

MPI_COMM_RANK(MPI_COMM_WORLD, rank,

ierror)

6rocess Is

print*, 'node', rank, ': Hello

world'

Each processor prints -5ello

2orld-

call MPI_FINALIZE(ierror) Finish =6I.

end

7ompile this with mpi9J 8the Fortran 9J wrapper: and submit with sub

mpif90 mpi-helloworld.f90 -o mpi-helloworld

qsub pbs-helloworld

+he mpihelloworld program is an example of using =6I in a manner that is

similar to a #ingle Instruction =ultiple ata architecture. +he same instruction


33/70

stream 8print hello world: is used across multiple times. It is perhaps best

described as #ingle 6rogram =ultiple ata, as it obtains the effect of running the

same program multiple times, or, if you li*e different programs with the same

instructions.

3.3 M84


34/70

;%;%6 MI==O2>"3 created. 7ommunicators are considered

analoguous to the mail or telephone system? every message travels in the

communicator, with every message passing call having a communcator

argument.

+he input paramters are argc, a pointer to the number of arguments, and argv,

the argument vector. +hese are for 7 and 7LL only. +he Fortranonly output

parameter is IE"">", as integer.

+he syntax for =6IOInit8: is as follows for 7, Fortran, and 7LL.

C S9ntax #include

int MPI_Init(int *argc, char ***argv)

Fortran S9ntax


35/70

INCLUDE mpif.h

MPI_INIT(IERROR)

INTEGER IERROR

C>> S9ntax

#include void MPI::Init(int& argc, char**& argv) void MPI::Init()

;%;%2 MI"3. +he input parameter is comm, which the handle for the

communicator, and the output paramter is siCe, the number of processes in thegroup of comm 8integer: and the Fortran only IE"">" providing the error status

as integer.

communicator is effectively a collection of processes that can send messages

to each other. 2ithin programs many communications also depend on the

number of processes executing the program.

+he syntax for =6IO7ommOsiCe8: is as follows for 7, Fortran, and 7LL.

C S9ntax #include

int MPI_Comm_size(MPI_Comm comm, int *size)

Fortran S9ntax INCLUDE mpif.h

MPI_COMM_SIZE(COMM, SIZE, IERROR)

INTEGER COMM, SIZE, IERROR

C>> S9ntax #include

int Comm::Get_size() const


36/70

;%;%; MI" error status

for Fortran. It is common for =6I programs to be written in a managerQwor*er

model, where one process 8typically ran* J: acts in a supervisory role, and the

other processes act in a computational role.

+he syntax for =6IO7ommOran*8: is as follows for 7, Fortran, and 7LL.

C S9ntax #include

int MPI_Comm_rank(MPI_Comm comm, int *rank)


MPI_COMM_RANK(COMM, RANK, IERROR)

INTEGER COMM, RANK, IERROR

C>> S9ntax #include

int Comm::Get_rank() const

;%;%4 MI


37/70

implementation. +he messagepassing system ta*e care of delivery. 5owever this

Mappropriate wayN means stating various characteristics of the message 0ust li*e

the post or email? who is sending it, where itRs being sent to, what itRs about, and

so forth.

+he input parameters include buf, the initial address of the send buffer., count,

an integer of the number of elements., datatype, a handle of the datatype of each

send buffer., dest, an integer ran* of the destination., tag, an integer message

tag, and comm, the communicator handle. +he only output parameter is

Fortran's , IE"">".

If =6I7omm represents a community of addressable space, then =6I#end and

=6I"ecv the envelope, addressing information and the data. In order for a

message to be successfully communicated the system must append someinformation to the data that the application program wishes to transmit. +his

includes the ran* of the sender, the receiver, a tag, and the communicator. +he

source is used to differentiate messages received from different sources? the tag

to distinguish messages from a single process.

+he syntax for =6IO#end8: is as follows for 7, Fortran, and 7LL.

C S9ntax

Uinclude ;mpi.h> S9ntax #include

void Comm::Send(const void* buf, int count, const Datatype&

datatype, int dest, int tag) const


38/70

;%;%8 MI


39/70

void Comm::Recv(void* buf, int count, const Datatype& datatype,

int source, int tag) const

+he importance of =6IO#end8: and =6IO"ecv8: refers to the nature of process

variables, which remain private unless passed by =6I in the 7ommunications

2orld.

;%;%: MI


40/70

int MPI_Finalize()


MPI_FINALIZE(IERROR)

INTEGER IERROR

C>> S9ntax #include

void Finalize()

2hilst the previous mpihelloworld.c and the mpihelloworld.f9J examples

illustrated the use of four of the six core routines of =6I, it did not illustrate the

use of the =6IO"ecv and =6IO#end routines. +he following program, of nogreater complexity, does this. +here is no need to provide additional explanation

of what is happening, as this should be discerned from the routine explanations

given. Each program should be compiled with mpicc and mpif9J respectively,

submitted with sub, with the results chec*ed.

7ompile with mpicc -o mpi-sendrecv mpi-sendrecv.c, submit with qsub pbs-sendrecv

#include

#include #include

int main(argc,argv)

int argc;

char *argv[];

{

int myid, numprocs;

int tag,source,destination,count;

int buffer;

MPI_Status status;

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

MPI_Comm_rank(MPI_COMM_WORLD,&myid);

tag=1;

source=0;

destination=1;


41/70

count=1;

if(myid == source){

buffer=1234;

MPI_Send(&buffer,count,MPI_INT,destination,tag,MPI_COMM_WORLD);

printf("processor %d sent %d\n",myid,buffer);

}

if(myid == destination){

MPI_Recv(&buffer,count,MPI_INT,source,tag,MPI_COMM_WORLD,&status);printf("processor %d received %d\n",myid,buffer);

}

MPI_Finalize();

}

+he mpisendrecv.f program? compile with mpif9J mpisendrecv.f9J, submit with

sub pbssendrecv

program sendrecv

include "mpif.h"

integer myid, ierr,numprocs

integer tag,source,destination,count

integer buffer

integer status(MPI_STATUS_SIZE)

call MPI_INIT( ierr )

call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )

call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

tag=1

source=0

destination=1count=1

if(myid .eq. source)then

buffer=1234

Call MPI_Send(buffer, count, MPI_INTEGER,destination,&

tag, MPI_COMM_WORLD, ierr)

write(*,*)"processor ",myid," sent ",buffer

endif

if(myid .eq. destination)then

Call MPI_Recv(buffer, count, MPI_INTEGER,source,&

tag, MPI_COMM_WORLD, status,ierr)

write(*,*)"processor ",myid," received ",buffer

endifcall MPI_FINALIZE(ierr)

stop

end

+he following provides a summary use of the six core routines in 7 and Fortran.


42/70

Purpo!e C Fortran

&nc%ude header fi%es #include INCLUDE mpif.h

&nitia%i9e P&int MPI_Init(int *argc, char

***argv)

INTEGER IERROR

CALL MPI_INIT(IERROR)

etermine num5er

of processes 2ithin

a communicator

int MPI_Comm_size(MPI_Comm

comm, int *size)

INTEGER COMM,SIZE,IERRORCALL

MPI_COMM_SIZE(COMM,SIZE,IERROR)

etermine processor

ran8 2ithin a

communicator

int MPI_Comm_rank(MPI_Comm

comm, int *rank)

INTEGER COMM,RANK,IERROR

CALL

MPI_COMM_RANK(COMM,RANK,IERROR)

"end a message

int MPI_Send (void *buf,int

count, MPI_Datatype

datatype, int dest, int tag,

MPI_Comm comm)

BUF(*)

INTEGER COUNT,

DATATYPE,DEST,TAG

INTEGER COMM, IERROR

CALL MPI_SEND(BUF,COUNT,DATATYPE, DEST, TAG, COMM,

IERROR)

,eceive a message

int MPI_Recv (void *buf,int

count, MPI_Datatype

datatype, int source, int

tag, MPI_Comm comm,

MPI_Status *status)

BUF(*)

INTEGER COUNT, DATATYPE,

SOURCE,TAG

INTEGER COMM, STATUS, IERROR

CALL MPI_RECV(BUF,COUNT,

DATATYPE, SOURCE, TAG, COMM,

STATUS, IERROR)

Exit P&int MPI_Finalize() CALL MPI_FINALIZE(IERROR)

$.? 4ntermediate M84 8rogramming

$.1 M84 5atat9%es

3i*e 7 and Fortran 8and indeed, almost every programming language that comes

to mind:, =6I has datatypes, a classification for identifying different types of

data 8such as real, int, float, char etc:. In the introductory =6I program there

wasnRt really much complexity in these types? as one delves deeper however

more will be encountered. Forewarned is forearmed, so the following provides a

handy comparison chart between =6I, 7, and Fortran.


43/70

M84 5ATAT+8E FTA@ 5ATAT+8E

MPI_INTEGER INTEGER

MPI_REAL REAL

MPI_DOUBLE_PRECISION DOUBLE PRECISION

MPI_COMPLEX COMPLEX

MPI_LOGICAL LOGICAL

MPI_CHARACTER CHARACTER

MPI_BYTE

MPI_PACKED

M84 5ATAT+8E C 5atat9%e

MPI_CHAR signed char

MPI_SHORT signed short int

MPI_LONG signed long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

MPI_PACKED

$.2 4ntermediate otines

In the Intermediate course one of the last excersises involved the submission of

mpiping and mpipong. +he first simply tested whether a connection existed

between multiple processors. +he second program tested different pac*et siCes,

asynchronous, and bidirectional. In this example there is pingOpong.c, from the


44/70

&niversity of Edinburgh 6arallel 7omputing 7entre, and a fortran 9J version of

the same from 7olorado &niversity. +he usual methods can be used for compiling

and submitting these programs, e.g.,

mpicc -o mpi-pingpong mpi-pingpong.c ormpicc mpi-pingpong.f90 -o mpi-pingpong and

qsub pbs-pingpong

5owever for this course the interesting components is what is inside the code in

terms of the =6I routines. s previously there is the mpi.h include files, the

initialisation routines, the establishment of a communications world and so forth.

In addition however there are some new routines, specifically =6IO2time,=6IObort, and =6IO#send.

4%2%6 MI


45/70

C S9ntax

#include

double MPI_Wtime()

Fortran S9ntax

INCLUDE mpif.h

DOUBLE PRECISION MPI_WTIME()

C>> S9ntax

#include

double MPI::Wtime()

4%2%2 MI


46/70

#include int MPI_Abort(MPI_Comm comm, int errorcode)

Fortran S9ntaxINCLUDE mpif.h

MPI_ABORT(COMM, ERRORCODE, IERROR)INTEGER COMM, ERRORCODE, IERROR

C>> S9ntax#include

void Comm::Abort(int errorcode)

4%2%; MItherwise, =6IO#end is the more

flexible option.

+he available input parameters include buf, the initial addess of the send buffer.,

count., an nonnegative integer of the number of elements in the send buffer.,datatype, a datatype of each send buffer element as a handle., dest, an integer

ran* of destination., tag, a message tag represented as an integer, and comm,

the communicator handle. +he only output paramter is FortranRs IE""">".

+he syntax for =6IO#send8: is as follows for 7, Fortran, and 7LL.

C S9ntax

#include

int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest,

int tag, MPI_Comm comm)


47/70

Fortran S9ntax

INCLUDE mpif.h

MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)

BUF(*)

INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR

C>> S9ntax

#include

void Comm::Ssend(const void* buf, int count, const Datatype&

datatype, int dest, int tag) const

4%2%4 Other *end and Rec Ro$tines

lthough not used in the specific program 0ust illustrated there are actually a

number of other send options for >pen =6I. +hese include =6IOBsend ,

=6IO"send, =6IOIsend, =6IOIbsend, =6IOIssend, and =6IOIrsend. +hese are

worth mentioning in summary as follows

MI


48/70

indicates to the system to start copying data out of the send buffer. send

reuest can be determined being completed by calling the =6IO2ait,

=6IO2aitany, =6IO+est, or =6IO+estany with reuest returned by this function.

+he send buffer cannot be used until one of these conditions is successful, or an

=6IO"euestOfree indicates that the buffer is available.

MI


49/70

int tag, MPI_Comm comm, MPI_Request *request)

Fortran S9ntax INCLUDE mpif.h MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)

BUF(*) INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR

C>> S9ntax #include Request Comm::Isend(const void* buf, int count, const

Datatype& datatype, int dest, int tag) const

MI


50/70

MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST,

IERROR)

BUF(*)

INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR

C>> S9ntax

MI> S9ntax #include void Request::Wait(Status& status)

void Request::Wait()


51/70

A Smmar9 o! Some ther M84 Sendeceive Modes

Send Mode Ex%lanation Bene!its 8rolemsMPI_Send() #tandard send.

=ay be

synchronous or

buffering

Flexible tradeoff?

automatically

uses buffer if

available, but

goes for

synchronous if

not.

7an hide

deadloc*s,

uncertainty of

type ma*es

debugging

harder.

MPI_Ssend() #ynchronous

send. oesn'treturn until

receive has also

completed.

#afest mode,

confident thatmessage has

been received.

3ower

performance,especially

without non

bloc*ing.

MPI_Bsend() Buffered send.

7opies data to

buffer, program

free to continue

whilst message

delivered later.

$ood

performance.

%eed to be

aware of buffer

space.

Buffer

management

issues.

MPI_Rsend() "eceive send.

=essage must be

already posted or

is lost.

#light

performance

increase since

there's no

handsha*e.

"is*y and

difficult to

design.

s described previously the arguments dest and source in the various modes of

send are the ran*s of the receiving and the sending processes. =6I also allows

source to be a -wildcard- through the predefined constant =6IO%TO#>&"7E8to receive from any source: and =6IO%TO+$ 8to receive with any source:.

+here is no wildcard for dest. gain using the postal analogy, a receipient may be

ready to receive a message from anyone, but they can't send a message to

anywhere4

4%2%8 Ahe risonerBs Cilemma


52/70

+he example of the 6risonerRs ilemma 8cooperation vs competition: is provided

as an example to illustrate how nonbloc*ing communications wor*. It in this

example, there are ten rounds between two players. +here are different payoffs

for each. In this particular version the distinction is between cooperation and

competition for financial rewards. If both players cooperate they receive Y( for

the round. If they both compete, they receive Y1 each for the round. But if one

adopts a competitive stance and the other a cooperative stance, the competitor

receives YK and the cooperative player nothing.

serial version of the code is provided 8serialgametheory.c, serial

gametheory.f9J:."eview and then attempt a parallel version from the s*eleton

versions of =6I 8mpis*elgametheory.c, mpis*elgametheory.f9J:.Each process

must run one of the players decisionma*ing, then they both have to transmittheir decision to the other, and then update their own tally of the result. 7onsider

using =6IO#end8:, or =6IOIrecv8:, and =6IO2ait8:. >n completion review with a

solution provided with mpigametheory.c and mpigametheory.f9J and submit the

tas*s with sub.

).# Co$$ective Communication!

=6I can also conduct collective communications. +hese include =6IOBroadcast,

=6IO#catter, =6IO$ather, =6IO"educe, and =6IOllreduce. brief summary of

their syntax and description of effects is provided before a practical example.

+he basic principle and motivation is that whilst collective communications this

may provide a performance improvement, it will certainly provide clearer code.

7onsider the following 7 snippet of a root processor sending to all..

if ( 0 == rank ) {

unsigned int proc_I;

for ( proc_I=1; proc_I < numProcs; proc_I++ ) {

MPI_Ssend( &param, 1, MPI_UNSIGNED, proc_I, PARAM_TAG, MPI_COMM_WORLD );

}

}

else {

MPI_Recv( &param, 1, MPI_UNSIGNED, 0 /*ROOT*/, PARAM_TAG, MPI_COMM_WORLD, &status

);


53/70

}

"eplaced with

MPI_Bcast( &param, 1, MPI_UNSIGNED, 0/*ROOT*/, MPI_COMM_WORLD );

4%;%6 MI".

+he syntax for =6IOBroadcast8: is as follows for 7, Fortran, and 7LL.

C S9ntax#include

int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,

int root, MPI_Comm comm)

Fortran S9ntaxINCLUDE mpif.hMPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR)

BUFFER(*)

INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR

C>> S9ntax#include


54/70

void MPI::Comm::Bcast(void* buffer, int count,

const MPI::Datatype& datatype, int root) const = 0

4%;%2 MI


55/70

+he syntax for =6IO#catter8: is as follows for 7, Fortran, and 7LL.

C S9ntax

#include

int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype,

void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,

MPI_Comm comm)

Fortran S9ntax

INCLUDE mpif.h

MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR)

SENDBUF(*), RECVBUF(*)

INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT


C>> S9ntax

#include

void MPI::Comm::Scatter(const void* sendbuf, int sendcount,

const MPI::Datatype& sendtype, void* recvbuf,

int recvcount, const MPI::Datatype& recvtype,

int root) const

4%;%; MI


56/70

+he input parameters include sendbuff, the address of the send buffer.,

sendcount, an integer of the number of elements in the send buff., sendtype, the

datatype handle send buffer elements., recvcount, an root integer of the number

of elements in the receive buffer., rectype, the datatype handle for root of receive

buffer elements., root, the integer ran* of the sending process., and comm, the

communicator handle. +he output paramters include recbuf, the address of the

receive buff for root and the everdependable IE"">" for Fortran,

+he syntax for =6IO$ather8: is as follows for 7, Fortran, and 7LL.

C S9ntax

#include

int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype,

void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,

MPI_Comm comm)

Fortran S9ntax

INCLUDE mpif.h

MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT,

RECVTYPE, ROOT, COMM, IERROR)


INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT



57/70

C>> S9ntax

#include

void MPI::Comm::Gather(const void* sendbuf, int sendcount,

const MPI::Datatype& sendtype, void* recvbuf,

int recvcount, const MPI::Datatype& recvtype, int root,

const = 0

4%;%4 MI


58/70

integer number of elements in the send buffer., datatype, a handle of the

datatype of elements in the send buffers., op, a handle of the reduce operation.,

root, the integrer ran* of the root process., comm, the communicator handle. +he

output paramters are recvbug, the address of the receive buffer for root, and

Fortran's IE"">".

+he syntax for =6IO"educe8: is as follows for 7, Fortran, and 7LL.

C S9ntax

#include

int MPI_Reduce(void *sendbuf, void *recvbuf, int count,

MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

Fortran S9ntax

INCLUDE mpif.h

MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM,

IERROR)


INTEGER COUNT, DATATYPE, OP, ROOT, COMM, IERROR

C>> S9ntax

#include

void MPI::Intracomm::Reduce(const void* sendbuf, void* recvbuf,

int count, const MPI::Datatype& datatype, const MPI::Op& op,

int root) const

=6I reduction operations include the following

M84@ame Fnction

MPI_Max =aximum

MPI_MIN =inimum


59/70

MPI_SUM #um

MPI_PROD 6roduct

MPI_LAND 3ogical %

MPI_BAND Bitwise %

MPI_LOR 3ogical >"

MPI_BOR Bitwise >"

MPI_LXOR 3ogical exclusive >"

MPI_BXOR Bitwise exclusive >"

MPI_MAXLOC =aximum and location

MPI_MINLOC =iniumun and location

$.3.& ther Collective Commnications

>ther collective communications include

MI


60/70

+he program could send the data one at a time e.g.,

double results[5][5];

int i;

for ( i = 0; i < 5; i++ ) {

MPI_Send( &(results[i][0]), 1, MPI_DOUBLE,dest, tag, comm );

}

But this has overhead? message passing is always 8relatively: expensive. #o

instead, a datatype can be created that informs =6I how the data is stored so it

can be sent in one routine.

+o create a derived type there are two steps Firstly construct the datatype, with=6IO+ypeOvector8: or =6IO+ypeOstruct8: and and then commit the datatype with

=6IO+ypeO7ommit8:.

2hen all the data to send is the same data type use the vector method e.g.,

int MPI_Type_vector( int count, int blocklen, int stride, MPI_Datatype old_type,

MPI_Datatype* newtype )

/* Send the first double of each of the 5 rows */

MPI_Datatype newType;

double results[5][5];

MPI_Type_vector( 5, 1, 5, MPI_Double, &newType);

MPI_Type_commit( &newType );

MPI_Ssend( &(results[0][0]), 1, newType, dest, tag, comm );

%ote that when sending a vector, data on receiving processor may be of adifferent type eg

double recvData[COUNT*BLOCKLEN];

double sendData[COUNT][STRIDE];

MPI_Datatype vecType;

MPI_Status st;


61/70

MPI_Type_vector( COUNT, BLOCKLEN, STRIDE, MPI_DOUBLE, &vecType );

MPI_Type_commit( &vecType );

if( rank == 0 )

MPI_Send( &(sendData[0][0]), 1, vecType, 1, tag, comm );

else

MPI_Recv( recvData, COUNT*BLOCKLEN, MPI_DOUBLE, 0, tag, comm, &st );

If you have specific parts of a struct you wish to send and the members are of

different types, use the struct datatype.

int MPI_Type_struct( int count, int blocklen[], MPI_Aint indices, MPI_Datatype

old_types[],MPI_Datatype* newtype )

For example....

/* Send the Packet structure in a message */

struct {

int a;

double array[3];

char b[10];

} Packet;

struct Packet dataToSend;

nother example .

int blockLens[3] = { 1, 3, 10 };

MPI_Aint intSize, doubleSize;

MPI_Aint displacements[3];

MPI_Datatype types[3] = { MPI_INT, MPI_DOUBLE, MPI_CHAR };

MPI_Datatype myType;

MPI_Type_extent( MPI_INT, &intSize ); //# of bytes in an int

MPI_Type_extent( MPI_DOUBLE, &doubleSize ); // double

displacements[0] = (MPI_Aint) 0;

displacements[1] = intSize;

displacements[2] = intSize + ((MPI_Aint) 3 * doubleSize);


62/70

MPI_Type_struct( 3, blockLens, displacements, types, &myType );

MPI_Type_commit( &myType );

MPI_Ssend( &dataToSend, 1, myType, dest, tag, comm );

+here are actually other functions for creating derived types

MPI_Type_contiguous

MPI_Type_hvector

MPI_Type_indexed

MPI_Type_hindexed

In many applications, the siCe of a message to receive is un*nown before it is

received. 8e.g. number of particles moving between domains:. =6I has a way of

dealing with this elegantly. Firstly, receive side calls =6IO6robe before actuallyreceiving

int MPI_Probe( int source, int tag, MPI_Comm comm, MPI_Status *status )

Can then examine the status, and find length using:

int MPI_Get_count( MPI_Status *status,

MPI_Datatype datatype, int *count )

+hen the application dynamically allocate the recv buffer, and call =6IO"ecv.

&.& 8article Advector

+he particle advector handson excersise consists of two parts.

+he first example is designed to gain familiarity with the =6IO#catter8: routine

as a means of distributing global arrays among multiple processesors viacollective commuinication. &se the s*eleton code provided and determine the

number of particles to assign to each processor. +hen use the function

=6IO#catter8: to spread the global particle coordinates, ids and tags among the

processors.


63/70

For an advanced tests, on the root processor only, calculate the particle with the

smallest distance from the origin 8hint =6IO"educe8 : :. If the particle with the

smallest distance is ; 1.J from the origin, then flip the direction of movement of

all the particles. +hen modify your code to use the =6IO#catterv8: function to

allow the given number of particles to be properly distributed among a variable

number of processors.

int MPI_Scatterv (

void *sendbuf,

int *sendcnts,

int *displs,

MPI_Datatype sendtype,

void *recvbuf,

int recvcnt,

MPI_Datatype recvtype,

int root,

MPI_Comm comm )

+he second example is designed to gain a practical example of the use of =6I

derived data types. Implement a data type storing the particle information from

the previous exercise and use this data type for collective communications. #et

up and commit a new =6I derived data type, based on the struct below

typedef struct Particle {

unsigned int globalId;

unsigned int tag;

Coord coord;

} Particle;

5int =6IO+ypeOstruct8 :, =6IO+ypeOcommit8 :

+hen seed the random number seuence on the root processor only, and

determine how many particles are to be assigned among the respective

processors 8same as for last exercise: and collectively assign their data using the

=6I derived data type you have implemented.


64/70

&.' Creating A @e# Commnicator

2hen creating a new communicator, each communicator has associated with it a

group of ran*ed processes. Before creating a new comunicator, first we must

create a group for it. 7reate a new group be eliminating processes from an

existing group

MPI_Group worldGroup, subGroup;

MPI_Comm subComm;

int *procsToExcl, numToExcl;

MPI_Comm_group( MPI_COMM_WORLD, &worldGroup );

MPI_Group_excl( worldGroup, numToExcl, procsToExcl, &subGroup );

MPI_Comm_create( MPI_COMM_WORLD, subGroup, &subComm );

&.( 8ro!iling 8arallel 8rograms

6arallel 6erformance Issues include the following

V 7overage Z of the code that is parallel

V $ranularity mount of wor* in each section

V 3oad Balancing

V 3ocality 7ommunication structure

V #ynchroniCation 3oc*ing latencies

#ince the performance of parallel programs are dependant on so many issues, it

is an inherently difficult tas* to profile parallel programs.

+& 8+uning and nalysis &tilities: is a portable profiling and tracing tool*it for

performance analysis of parallel programs written in `ava, 7, 7LL and Fortran.


65/70

+he steps involved in profiling parallel code are outlined as follows

Instrument the source code with +au macros

7ompile the instrumented code

"un the program to view profile.V files for each separate process

+he instrumentation of source code can be done manually or with the help of

another utility called 6+, which automatically parses source files and

instruments them with +au macros.

&.) 5egging M84 A%%lications

It has ta*en many years for this essential truth to be realised, but software

euals bugs. In parallel systems, the bugs are particularly difficult to diagnose,

and the core principle of parallelisation suggests race conditions and deadloc*s.

For example, what happens when two processers try to send a message to one

another at the same time.

2hen debugging =6I programs it is usually a good idea to do this in oneRs own

environment, i.e., install 8from source: the compilers and version of openmpi on

your own system. +he reason for this is it is uite time prohibitive to conduct

debugging activities on a batchprocessing highperformance computer. +he 567

systems that we have may run tas*s fairly uic*ly when launched, but they can

ta*e some time to begin whilst they are in the ueue.

5 @T /@ BS @ TDE DEA5 @5E

REALLY, 5 @T /@ M/;T4CE BS @ TDE DEA5 @5E

It is possible, for small tests, to bypass this by running small 0obs interactively

8following the instructions given in the Intermediate course:. e,g,


66/70

qsub -l walltime=0:30:0,nodes=1:ppn=2 -I

module load vpac

qsub pbs-sendrecv

In general however, parallel programs are hard to program and hard to debug.

6arallelism adds a whole new abstract layer. lthough the program is being

executed onprocessors, it may be running inslightly different ways on

different data.

lthough time consuming serious it is usually appropriate to build the code in

serial first to the point thatRs itRs wor*ing, and wor*ing well. s part of thisprocess use versioning control systems, engage in unit 8chec* each functional

component of the code independently: and integration testing 8chec* the

interfaces between components: as part of this development. &se standard

methods for these tests, such as the use of midrange, boundary, and outof

bounds variables.

Because parallelism adds a new level of abstraction, producing a serial version of

a code before producing a parallel version is not unli*e producing pseudocode

for a serial program. +ime and time again it has been shown that modelling

significantly improves the uality of a program and reduces errors, thus saving

time in the longer run. In the process of engaging in such modelling, developing

a defensive style of programming is effective, for example engaging in the

techniues that prevent deadloc*s, or *eeping in consideration the state of a

condition when running loops or ifelse statements. 2hen conducting actual tests

on the code, a tactically placed printf or write statements will assist.

For example, consider the following simple sendrecv programs? compile these

with openmpigcc as follows

module load openmpi-gcc


67/70

mpicc -g mpi-debug.c -o mpi-debug or

mpif90 -g mpi.debug.f90 -o mpi-debug

qsub -l walltime=0:20:0,nodes=1:ppn=2 -I

module load vpac

module load valgrind/3.8.1-openmpi-gcc

%ote that an interactive 0ob starts the user in their home directory reuring a

change in directories.

2hen mpiexec with ( processors is launched with valgrind debugging the

executable and with error output redirected to valgrind.out.

mpiexec -np 2 valgrind ./mpi-sendrecv-debug 2> valgrind.out

]algrind is a debugging suite that automatically detects many memory

management and threading bugs. 2hilst typically built for serial applications, it

can also be built with mpicc wrappers, but currently only for $%& $77 or IntelRs

7LL compiler. It is important to use the same compiler used in both the build

and the ]algrind test.

+he file valgrind.out in this case will contain uite a few errors, but none of these

are critical to the operation of the program.

s with serial programs, gdb can also be used for thorough debugging. Execute

as

mpiexec -np [number of processers] gdb ./executable command=gdb.cmd


68/70

2here gdb.cmd is a text file of the commands that you want to send to gdb. e.g.,

module load gdb

mpiexec -np 2 gdb --exec=mpi-debug --command=gdb.cmd

2hich should generate a result something li*e the following

[lev@trifid166 advancedhpc]$ mpiexec -np 2 gdb commands=gdb.cmd mpi-debug

(remove license information)

Reading symbols from /nfs/user2/lev/programming/advancedhpc/mpi-debug...(no

debugging symbols found)...done.

Reading symbols from /nfs/user2/lev/programming/advancedhpc/mpi-debug...(no

debugging symbols found)...done.

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib64/libthread_db.so.1".

[Thread debugging using libthread_db enabled]

Using host libthread_db library "/lib64/libthread_db.so.1".

[New Thread 0x2aaaad875700 (LWP 19784)]

[New Thread 0x2aaaad875700 (LWP 19785)]

[New Thread 0x2aaaadc8b700 (LWP 19786)]

[New Thread 0x2aaaadc8b700 (LWP 19787)]

processor 0 final value: 324 with loop # 68

processor 1 final value: 2346 with loop # 68

[Thread 0x2aaaadc8b700 (LWP 19786) exited]

[Thread 0x2aaaad875700 (LWP 19784) exited][Thread 0x2aaaadc8b700 (LWP 19787) exited]

[Thread 0x2aaaad875700 (LWP 19785) exited]

[Inferior 1 (process 19776) exited normally]

[Inferior 1 (process 19777) exited normally]


69/70

+his of course, simple mentions that the program successfully with the final

values as listed 8hooray4:. +o use a serial debugger with $B that is running in

parallel is slightly more difficult. common hac* @ and it is a hac* @ is to find out

what process Is that the 0ob is doing then to log in to the appropriate node and

run gdb p 6I. 5owever in order to discover that the following code snippet is

usually implemented

{

int i = 0;

char hostname[256];

gethostname(hostname, sizeof(hostname));

printf("PID %d on %s ready for attach\n", getpid(), hostname);

fflush(stdout);

while (0 == i)

sleep(5);

}

+hen at 0ob submission those 6Is will be displayed. For example,

[lev@trifid166 advancedhpc]$ mpiexec -np 2 mpi-debug

PID 23166 on trifid166 ready for attach

PID 23167 on trifid166 ready for attach

+hen login to the appropriate nodes and run gdb p (K1!! and gdb p (K1! amd

step through the function stac* and set the variable to a nonCero value, e.g.,

(gdb) set var i = 7

+hen set a brea*point after your bloc* of code and continue execution until the

brea*point is hit 8e.g., by adding brea* in the loops on lines )9 and !): and using

the gdb commands to display the values as they are being generated 8e.g., print

loop, print value, or info locals:.


70/70

110 !/

--/ ?A@: =1 $ >>!/ --

infovpac.org***.vpac.or%
mailto:[email protected]:[email protected]

advanced hpc and linux 20130820

Documents