advanced hpc and linux 20130820

Upload: lev-lafayette

Post on 02-Mar-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/26/2019 Advanced Hpc and Linux 20130820

    1/70

  • 7/26/2019 Advanced Hpc and Linux 20130820

    2/70

    Advanced High

    Perfomance Computing Using Linux

    Advanced High PerformanceComputing Using Linux

    TABLE OF CONTENTS

    1.0 Advanced Linux

    1.1 Emacs: The ost Advanced Text Edito

    1.! Advanced "cripting

    !.0 Advanced HPC

    !.1 Computer "#stem Architectures

    !.! Processors and Cores

    !.$ Para%%e% Processing Performance

    $.0 &ntroductor# P& Programming

    $.1 The "tor# of the essage Passing &nterface 'P&( and )penP&

    $.! Unrave%%ing A "amp%e P& Programming and )penP& *rappers

    $.$ P&+s &ntroductor# ,outines

    -.0 &ntermediate P& Programming

    -.1 P& atat#pes

  • 7/26/2019 Advanced Hpc and Linux 20130820

    3/70

    -.! &ntermediate ,outines

    -.$ Co%%ective Communications

    -.- erived ata T#pes

    -./ Partic%e Advector

    -. Creating A e2 Communicator

    -.3 Profi%ing Para%%e% Programs

    -.4 e5ugging P& App%ications

    1.0 Advanced Linux

    1.1 Emacs : The Most Advanced Text Editor

    In the introductory course, we used nano as our text editor. In the intermediate

    course, we used vim. Finally, in this advanced course, we'll provide an

    introduction to Emacs, which can be described as the most advanced text editor

    and is particularly popular among programmers.

    Emacs is one of the oldest continious software applications available, first

    written in 19! by "ichard #tallman, founder of the $%& free software

    movement. t the time of writing it was up to version (), with a substantial

    number of for*s and clones developed during its history.

    +he big features of Emacs is the extremely high level of builtin commands,

    customisation, and extensions, so extensive that those explored here only begin

    to touch the extraordinary diverse world that is Emacs. Indeed, Eric "aymond,notes -i/t is a common 0o*e, both among fans and detractors of Emacs, to

    describe it as an operating system masuerading as an editor-.

    2ith extensions, Emacs includes 3a+ex formatted documents, syntax

    highlighting for ma0or programming and scripting languages, a calculator, a

    calender and planner, a textbased adventure, a webbrowser, a newsreader and

  • 7/26/2019 Advanced Hpc and Linux 20130820

    4/70

    email client, and an ftp client. It provides file difference, merging, and version

    control, a textbased adventure game, and even a "ogerian psychotherapist.

    +ry doing that with %otepad4

    +his all said, Emacs is not easily to learn for beginners. +he level of

    customisation and the detailed use of meta and control characters does serve as

    a barrier to immediate entry.

    +his tutorial will provide a useable introduction to Emacs.

    1.1.1 Starting Emacs

    +he defaut of Emacs installations on contemporary 3inux systems assume the

    use of a graphicaluser interface. +his is obviously not the case with an 567

    system, but for those with a home installation you should be aware that 'emacs

    nw' from the commandline will launch the program without the $&I. If you wish

    to ma*e this the default you should add it as an alias to .bashrc 8e.g., alias

    emacs'emacs nw':.

    Emacs is launched by simply typing 'emacs' on the command line. 7ommands are

    invo*ed by a combination of the 7ontrol 87trl: *ey and a character *ey 87;chr

  • 7/26/2019 Advanced Hpc and Linux 20130820

    5/70

    +o -brea*- from a partially entered command, 7g.

    If an Emacs session crashed recently, =x recoversession can recover the files

    that were being edited.

    +he menubar can be activated with =A

    +he help files are accessed with 7h and the manual with 7h r.

    1.1.3 Files, B!!ers, and "indo#s

    Emacs has three main data structures, Files, Buffers, and 2indows which are

    essential to understand.

    file is the what is the actual file on dis*. #trictly, when using Emacs one does

    not actually edit file. "ather, what happens is the file is copied into a buffer, then

    edited, and then saved. Buffers can be deleted without deleting the file on dis*.

    +he buffer is an data space within Emacs for editing a copy of the file. Emacs can

    handle many buffers simultaneously, the effective limit being the maximum

    buffer siCe, determined by integer capacity of the processor and memory 8e.g.,

    for !)bit machines, this maximum buffer siCe is (D!1 ( bytes:. buffer has a

    name, usually after the file from which it has copied the data.

    window is the user's view of a buffer. %ot all buffers may be visible to the user

    at once due to the limits of screen siCe. user may split the screen into multiple

    windows. 2indows can be created and deleted, without deleting the bufferassociated with the window.

    Emacs also has a blan* line below the mode line to display messages, and for

    input for prompts from Emacs. +his is called the minibuffer, or echo.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    6/70

    1.1.$ Ex%loring and Entering Text

    7ursor *eys can be used to mover around the text, along with 6age &p and 6age

    own, if the terminal uses them. 5owever Emacs afficiandos will recommend the

    use of the control *ey for speed. 7ommon commands include the following? you

    may notice a pattern in the command logic

    7v 8move page down:, =v 8move page up:

    7p 8move previous line:, 7n 8move next line:,

    7f 8move forward, one character:, 7b 8move bac*ward, one character:

    =f 8move forward, one word:, =b 8move bac*ward, one word:

    7a 8move to beginning of a line:, 7e 8move to end of a line:

    =a 8move forward, beginning of a sentence:

    =e 8move bac*ward, beginning of a sentence:=G 8move bac*ward, beginning of a paragraph:, =H 8end of paragraph:

    =; 8move move tbeginning of a text:, =< 8move end of a text:.

    ;bac*space< 8delete the character 0ust before the cursor :

    7d 8delete the character on the cursor:

    =;bac*space< 8cut the word before the cursor:

    =d 8cut the word after the cursor :

    7* 8cut from the cursor position to end of line :

    =* 8cut to the end of the current sentence :

    7 8prefix command? use when you want to enter a control *ey into the buffer

    e.g,. 7 E#7 inserts an Escape:

    3i*e the pageup and pagedown *eys on a standard *eyboard you will discoverthat Emacs also interprets the Bac*space and elete *ey as expected.

    selection can be cut 8or '*illed' in Emacs lingo: by mar*ing the beginning of the

    selected text with 7#67 8space: and ending it with with standard cursor

    movements and entering 7w. +ext that has been cut can be pasted 8'yan*ed': by

    moving the cursor to the appropriate location and entering 7y.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    7/70

    Emacs commands also accept a numeric input for repetition, in the form of 7u,

    the number of times the command is to be repeated, followed bythe command

    8e.g., 7u 7n moves eight lines down the screen.:

    1.1.& File Management

    +here are only three main file manipulation commands that a user needs to

    *now? how to find a file, how to save a file from a buffer, and how to save all.

    +he first command is 7x 7f, shorthand for -findfile-. t first this command

    chec*s prompts for the name of the file. If it is already copied into a buffer it willswitch to that buffer. If it is not, it will create a new buffer with the name

    reuested.

    For the second command, to save a buffer to a file file with the buffer name use

    7x 7s, shorthand for -savebuffer-.

    +he third command is 7x s. +his is shorthand for -savesomebuffers- and will

    cycle through each open buffer and prompt the user for their action 8save, don't

    save, chec* and maybe save, etc:

    1.1.' B!!er Management

    +here are four main commands relating to buffer management that a user needs

    to *now. 5ow to switch to a buffer, how to list existing buffers, how to *ill a

    buffer, and how to read a buffer in readonly mode.

    +o switch to a buffer, user 7x b. +his will prompt for a buffer name, and switch

    the buffer of the current window to that buffer. It does not change your existing

    windows. If you type a new name, it will create a new empty buffer.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    8/70

    +o list current active buffers, use 7x 7b. +his will provide a new window which

    lists current buffers, name, whether they have been modified, their Cie, and the

    file that they are associated with.

    +o *ill a buffer, use 7x *. +his will prompt for the buffer name, and then remove

    the data for that buffer from Emacs, with an opportunity to save it. +his does not

    delete any associated files.

    +he toggle read only mode on a buffer, use 7x 7.

    1.1.( "indo# Management

    Emacs has it's own windowing system, consisting of several areas of framed text.

    +he behaviour is similar to a tiling window manager? none of the windows

    overlap with each other.

    7ommonly used window commands include

    7x J delete the current window

    7x 1 delete all windows except the selected window

    7x ( split the current window horiContally

    7x K split the current window vertically

    7x D ma*e selected window taller

    7x H ma*e selected window wider

    7x G ma*e selected window narrower

    7x L ma*e all windows the same height

    command use is to bring up other documents or menus. For example, with the*ey seuence 7h one usually calls for help files. If this is followed by *, it will

    open a new vertical window, and with 7f, it will display the help information for

    the command 7f 8i.e., 7h * 7f:. +his new window can be closed with 7x 1.

    1.1.) *ill and +an, Search-e%lace, /ndo

  • 7/26/2019 Advanced Hpc and Linux 20130820

    9/70

    Emacs is notable for having a Mvery largeN undo seuence, limited by system

    resources, rather than application resources. +his undo seuence is invo*ed with

    7O 8control underscore:, or with 7x u . 5owever it has a special feature that, by

    engaging in a simple navigation command 8e.g., 7f: the undo action is pushed to

    the top of the stac* and therefore the user can undo an undo command.

    1.1.0 ther Featres

    Emacs can ma*e it easier to read 7 and 7LL by colourcoding such files,

    through the PQ.emacs configuration file, and adding Mglobalfontloc*mode tN.

    6rogrammers also find the feature on being able to run the $%& debugger 8$B:

    from within Emacs as well. +he command =x gdb will start up gdb. If thereRs abrea*point, Emacs automatically pulls up the appropriate source file, which gives

    a better context than the standard $B.

    1.2 Advanced Scri%ting

    $ood *nowledge of scripting is reuired for any advanced 3inux user, and

    especially those who find that they have regular tas*s, such as the processing of

    data through a program. #hell scripting is no terribly difficult, although

    sometimes some austere syntax bugs may prove frustrating but the machine is

    0ust doing what you as*ed it to. espite their often underrated utility shell

    scripts are not the answer to everything. +hey are not great at resource intensive

    tas*s 8e.g., extensive file operations: where speed is important. +hey are not

    recommended for heavyduty maths operations 8use 7, 7LL, or Fortran instead:.

    It is not recommended in situations where data structures, multidimensional

    arrays 8it's not a database4: and portQsoc*et IQ> is important.

    In the Intermediate 7ourse, we loo*ed at scripting in reference to regular

    expression utilities, such as sed and the programming language aw*, along with

    some simple examples of using 3inux command invocations as variables in a

    bac*up script, some sample -for-, -while-, -doQdone- and -until- loops along with

    simple, optional, ladder, and nested conditionals using -if-, -then-, -else-, -else-,

    -elif- and -fi-, the use of -brea*- and -continue-, and the -case- conditional, and

    -select- for user input. +he implementation of these into 6B# 0ob submission

  • 7/26/2019 Advanced Hpc and Linux 20130820

    10/70

    scripts was also illustrated. In this dvanced course we will revisit these

    concepts but with more sophisticated and complex examples. In addition there

    will be a close loo* at internal commands and filters, process substitution,

    functions, arrays, and debugging.

    1.2.1 Scri%ts "ith ariales

    +he simplest script is simply one that runs a list of system commands. t least

    this saves the time of retyping the seuence each time it is used, and reduces the

    possibility of error. For example, in the Intermediate course, the following script

    was recommended to calculate the dis* use in a directory. It's a good script, very

    handy, but how often would you want to type itS Instead, type enter it once and

    *eep it. Tou will recall of course, that a script starts with an invocation of the

    shell, followed by commands.

    emacs dis*use.sh

    U4QbinQbash

    du s* V W sort nr W cut f( W xargs d -Xn- du sh < dis*use.txt

    7x 7c, y for save

    chmod Lx dis*use.sh

    s described in the Intermediate course, script runs a dis* usage in summary,

    sorts in order of siCe and exports to the file dis*use.txt. +he -Xn- is to ignore

    spaces in filenames.

    =a*ing the script a little more complex, variables are usually better than hard

    coded values. +here are two potential variables in this script, the wildcard 'V' and

    the exported filename -dis*use.txt-. In the former case, we'll *eep the wildcared

    as it allows a certain portibility of the script it can run in any directory it is

    invo*ed from. For the latter case however, we'll use the date command so that a

    history of dis*use can be created which can be reviewed for changes. It's also

  • 7/26/2019 Advanced Hpc and Linux 20130820

    11/70

    good practise to alert the user when the script is completed and, although it is

    often necessary, it is also good practise to cleanly finish any script with with

    'exit'.

    emacs dis*use.sh

    U4QbinQbash

    &dis*useY8date LZTZmZd:.txt

    du s* V W sort nr W cut f( W xargs d -Xn- du sh < Y&

    echo -is* summary completed and sorted.-

    exit

    7x 7c, y for save

    1.2.2 ariales and Conditionals

    nother example is a script with conditionals as well as variables. common

    conditional, and sadly often forgotten, is whether or not a script has the reuiste

    files for input and output specified. If an input file is not specified a script that

    performs an action on the file will simple go idle and never complete. If an

    output file is hardcoded, then the person running the script runs the ris* of

    overwriting a file with the same name, which could be a disaster.

    +he following script searches through any specified text file for text before and

    after the ubiuitous email -[- symbol and outputs these as a csv file through use

    of grep, sed, and sort 8for neatness:. If the input or the output file are not

    specified, it exits after echoing the error.

    emacs findemails.sh

    #!/bin/bash

    # Search for email addresses in file, extract, turn into csv with designated file

    name

    INPUT=${1}

    OUTPUT=${2}

  • 7/26/2019 Advanced Hpc and Linux 20130820

    12/70

    {

    if [ !$1 -o !$2 ]; then

    echo "Input file not found, or output file not specified. Exiting script."

    exit 0

    fi

    }

    grep --only-matching -E '[.[:alnum:]]+@[.[:alnum:]]+' $INPUT > $OUTPUT

    sed -i 's/$/,/g' $OUTPUTsort -u $OUTPUT -o $OUTPUT

    sed -i '{:q;N;s/\n/ /g;t q}' $OUTPUT

    echo "Data file extracted to" $OUTPUT

    exit

    7x 7c, y for save

    chmod Lx findemails.sh

    +est this file with hidden.txt as the input text and found.csv as the output text.

    +he output will include a final comma on the last line but this is potentially useful

    if one wants to run the script with several input files and append to the same

    output file 8simply change the single redirection in the grep statement to an

    double appended redirection.

    serious wea*ness of the script 8so far: is that it will gather any string with the'[' symbol in it, regardless of whether it's a wellformed email address or not. #o

    it's not uite suitable for screenscraping usenet for email address to turn into a

    spammers list. But it's getting close.

    1.2.3 eads

    +he read command simply reads a line from standard input. By applying the noption is can read in a number of characters, rather than a whole line, so n1 is

    -read a single character-. +he use of the r option reads the input as raw input,

    so that the bac*slash *ey 8for example: doesn't act li*e a a newline escape

    character, and the p option displays the prompt. 6lus, a t timeout in seconds

    option can also added. 7ombined, can be used in the effect of -press any *ey to

    continue-, with a limited timeframe.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    13/70

    dd the following to findemails.sh at the end of the file.

    emacs findemails.sh

    #!/bin/bash

    # Search for email addresses in file, extract, turn into csv with designated file

    name

    ..

    ..

    read -t5 -n1 -r -p "Press any key too see the list, sorted and with unique

    record..."

    if [ $? -eq 0 ]; then

    echo A key was pressed.

    else

    echo No key was pressed.

    exit 0

    fi

    less $OUTPUT | \

    # Output file, piped through sort and uniq.

    sort | uniq

    exit

    7x 7x, y for save

    1.2.$ S%ecial Characters

    #cripts essentially consist of commands, *eywords, and special characters.

    #pecial characters have meaning beyond their literal meaning 8a metameaning,

    if you li*e:. 7omments are the most common special meaning.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    14/70

    ny text following a U 8with the exception of U4: is comments and will not be

    executed. 7omments may begin at the beginning of a line, following whitespace,

    following the end of a command, and even be embedded within a piped command

    8as above in section K:.

    comment ends at the end of the line, and as a result a command may not follow

    a comment on the same line. uoted or an escaped U in an echo statement

    does not begin a comment.

    nother special characters includes the command seperator, a semicolon, which

    is used to permit two or more commands on the same line. +his is already shown

    by the the various tests in the script 8e.g., if [ !$1 -o !$2 ]; thenand if [ $? -eq0 ]; then:. %ote the space after the semicolon. In contrast a double semicolon 8??:

    represents a terminator in a case option, which was encountered in the extract

    script in the Intermediate course.

    ..

    case $1 in

    *.tar.bz2) tar xvjf $1 ;;

    *.tar.gz) tar xvzf $1 ;;

    *.bz2) bunzip2 $1 ;;

    ..

    ..

    esac

    In contrast, the colon acts as a null command. 2hilst this obviously has a variety

    of uses 8e.g., an alternative to the touch command, a really practical advantage

    of this is that comes with a true exit status, and as such it can be used as

    placeholder in ifQthen tests. n example from the Intermediate course?for i in *.plot.dat; do

    if [ -f $i.tmp ]; then

    : # do nothing and exit if-then

    else

    touch $i.tmp

    +he use of the null command as a test at the beginning of a loop will cause it to

    run endlessley 8e.g., ;code

  • 7/26/2019 Advanced Hpc and Linux 20130820

    15/70

    evaluates as true. %ote that the colon is also used as a field separator in

    QetcQpasswd and in the Y6+5 variable.

    dot 8.: has multiple special character uses. s a command it sources a

    filename, importing the code into a script, rather li*e the Uinclude directive in a

    7 program. +his is very useful in situations when multiple scripts use a common

    data file, for example 8e.g., . hidden.txt:. s part of a filename of course, as was

    shown in the Introductory course, the . represents the current wor*ing directory

    8e.g., cp r QpathQtoQdirectoryQ . and of course, .. for the parent directory:. third

    use for the dot is in regular expressions, matching one character per dot. final

    use is multiple dots in seuence in a loop. e.g.,

    for a in {1..10}

    do

    echo -n "$a "

    done

    3i*e the dot, the comma operator has multiple uses. &sually it is used to lin*

    multiple arithmetic calculations. +his is typically used in for loops, with a 7li*e

    syntax. e.g.,

    for ((a=1, b=1; a

  • 7/26/2019 Advanced Hpc and Linux 20130820

    16/70

    doubleuote on a value does not change variable substitution. +his is

    sometimes referred to as wea* uoting. &sing single uotes however, means the

    variable to be used literally, with no substitution. +his is often referred to as

    strong uoting. For example, a strict single uoted directory listing of ls with a

    wildcard will only provide files that are expressed by the symbol 8which isn't a

    very good file name:. 7ompare ls V with ls 'V'. +his example will also worth with

    double uote and indeed, doubleuotes are generally preferable as they prevent

    reinterpretation of all special characters except Y, A, and X. +his are usually the

    symbols which are wanted in their interpreted mode. s the escape character

    has a literal interpretation with single uotes, enclosing a single uote within

    single uotes will not wor* as expected.Enclosing a referenced value in double

    uotes 8- ... -: does not interfere with variable substitution. +his is called partial

    uoting, sometimes referred to as -wea* uoting.- &sing single uotes 8' ... ':

    causes the variable name to be used literally, and no substitution will ta*e place.+his is full uoting, sometimes referred to as 'strong uoting.'

    "elated to uoting is the use of the bac*slash 8X: used to escape single

    characters. o not confuse it with the forward slash 8Q: has multiple uses as both

    the separator in pathnames 8e.g., 8QhomeQtrainJ1:, but also a the division

    operator.

    In some scripts bac*tic*s 8A: are used for command substitution, where the

    output of a command can be assigned to a variable. 2hilst this is not a 6>#I\

    standard, it does exist for historical reasons. %esting commands with bac*tic*s

    also reuires escape characters? the deeper the nesting the more escape

    characters reuired 8e.g., echo Aecho XAecho XXXApwdXXXAXAA:. +he preferrred and

    6>#I\ standard method is to use the dollar sign and parentheses. e.g., echo

    -5ello, Y8whoami:.- rather than echo -5ello, Awhoami:A.-

    2.0 Advanced HPC

    2.1 Computer S!tem Arc"itecture!

  • 7/26/2019 Advanced Hpc and Linux 20130820

    17/70

    s explained in the first, introductory, course, -highperformance computing

    8567: is the use of supercomputers and clusters to solve advanced computation

    problems-. ll supercomputers 8-a nebulous term for computer that is at the

    frontline of current processing capacity-: in contemporary times use parallel

    computing, -the submission of 0obs or processes over one or more processors

    and by splitting up the tas* between them-.

    It is possible to illustrate the degree of parallelisation by using Flynn's +axonomy

    of 7omputer #ystems 819!!:, where each process is considered as the execution

    of a pool of instructions 8instruction stream: on a pool of data 8data stream:.

    From this complex is four basic possibilities

    "ing%e &nstruction "tream6 "ing%e

    ata "tream '"&"(

    "ing%e &nstruction "tream6 u%tip%e

    ata "treams '"&(

    u%tip%e &nstruction "treams6 "ing%e

    ata "tream '&"(

    u%tip%e &nstruction "treams6

    u%tip%e ata "treams '&(

    2.1.1 Single 4nstrction Stream, Single 5ata Stream 6S4S57

    (Image from Oracle Essentials, 4th edition, O'Reilly Media, 2007)

    +his is the simplest and, until recently, the most common processor

    architecture on des*top computer systems. lso *nown as a

    uniprocessor system it offers a single instruction set and a single

    data stream. &niprocessors could however simulate or include

    concurrency through a number of different methods

    a: It is possible for a uniprocessor system to run processes

    concurrently by switching between one and another.

    b: #uperscale instruction level parallelism can be used on uniprocessors. =ore

    than one instruction during a cloc* cycle is simultaneously dispatched to

    different functional units on the processor.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    18/70

    c: Instruction prefetch, where an instruction is reuested from main memory

    before it is actually needed and placed in a cache. +his often also includes a

    prediction algorithm of what the instruction will be.

    d: 6ipelines, on the instruction level or the graphics level, can also serve as an

    example of concurrent activity. n instruction pipeline 8e.g., "I#7: allows

    multiple instructions on the same circuty by dividing the tas* into stages.

    graphics pipeline implements different stages of rendering operations to

    different arithmetic units.

    2.1.2 Single 4nstrction Stream, Mlti%le 5ata Streams 6S4M57

    #I= architecture represents a situation where a single processor performs the

    same instruction on multiple data streams. +his commonly occurs in

    contemporary multimedia processors, for example ==\ instruction set from the

    199Js, which lead to =otorollaRs 6ower67 ltivec, and more contemporary times

    ]E 8dvanced ]ector Extensions: instruction set used in Intel #andy Bridge

    processors and ='s BulldoCer processor. +hese developments have primarily

    been orientated towards realtime graphics, using shortvectors. 7ontemporary

    supercomputers are invariably =I= clusters which can implement shortvector

    #I= instructions.

    #I= was also used especially in the 19Js and notably on the various 7ray

    systems. For example the 7ray1 819!: had eight -vector registers,- which held

    sixtyfour !)bit words each 8long vectors: with instructions applied to the

    registers. 6ipeline parallelism was used to implement vector instructions with

    separate pipelines for different instructions, which themselves cuold be run in

    batch and pipelined 8vector chaining:. s a result the 7ray1 could have a pea*

    performance of ()J mflops extraordinary for the day, and even acceptable in

    the early (JJJs.

    #I= is also *nown as vector processing or data parallelism, in comparison to a

    regular #I= 76& which operates on scalars. #I= lines up a row of scalar data

    8of uniform type: as a vector and operates on it as a unit. For example, inverting

    an "$B picture to produce its negative, or to alter its brightness etc. 2ithout

    #I= each pixel would have to be fetched to memory, the instruction applied to

  • 7/26/2019 Advanced Hpc and Linux 20130820

    19/70

    it, and then returned. 2ith #I= the same instruction is applied to all the data,

    depending on the availability of cores, i.e., get n pixels, apply instruction, return.

    +he main disadvantages of #I=, within the limitations of the process itself, is

    that it does reuire additional register, power consumption, and heat.

    2.1.3 Mlti%le 4nstrction Streams, Single 5ata Stream 6M4S57

    =ultiple Instruction, #ingle ata 8=I#: occurs when different operations are

    performed on the same data. +his is uite rare and indeed debateable as it is

    reasonable to claim that once an instruction has been performed on the data, it's

    not thesame data anymore. If one doesn't ta*e this definition and allows for a

    variety of instructions to be applied to the same data which can change then

    various pipeline architectures can be considered =I#.

    #ystolic arrays are another form of =I#. +hey are different to pipelines because

    they have nonlinear array structure, they have multidirectional data flow, and

    each processing element may even have its own local memory . In this situation a

    matrix pipe networ* arrangement of processing units compute data and store it

    independently of each other. =atrix multiplication is an example of such an array

    in an algorithmic form, where one a matric is introduced one row at a time from

    the top of the array, whereas another matrix is introduced one colum at a time.

    =I# machines are rare? the 7isco 6\F processor is an example. +hey can be

    fast and scalable, as they do operate in parallel, but they are reallydifficult to

    build.

    3.1.$ Mlti%le 4nstrction Streams, Mlti%le 5ata Streams 6M4M57

    =ultiple Instruction, =ultiple ata 8=I=: have independent and asynchronous

    processes that can operate on a number of different data streams. +hey are now

    the mainstream in contemporary computer systems and thus can be further

    differentiated between multiprocessor computers and their extension,

    multicomputer mutiprocessors. s the name clearly indicates, the former refers

    to single machines which have multiple processors and the latter to a cluster of

    these machines acting as a single entity.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    20/70

    =ultiprocessor systems can be differentiated between shared memory and

    distributed memory. #hared memory systems have all processors connected to a

    single pool of global memory 8whether by hardware or by software:. +his may be

    easier to program, but it's harder to achieve scalability. #uch an architecture is

    uite common in single system unit multiprocessor machines.

    2ith distributed memory systems, each processor has its own memory. Finally,

    another combination is distributed shared memory, where the 8physically

    separate: memories can be addressed as one 8logically shared: address space.

    variant combined method is to have shared memory within each multiprocessor

    node, and distributed between them.

    3.2 8rocessors and Cores

    2.2.1 /ni- and Mlti-8rocessors

    further distinction needs to be made between processors and cores.

    processor is a physical device that accepts data as input and provides results as

    output. uniprocessor system has one such device, although the definitions can

    become ambiguous. In some uniprocessor systems it is possible that there is

    more than one, but the entities engage in separate functions. For example, a

    computer system that has one central processing unit may also have a co

    processor for mathematic functions and a graphics processor on a separate card.

    Is that system uniprocessorS rguably not as the coprocessor will be seen as

    belonging to the same entity as the 76&, and the graphics processor will have

    different memory, system IQ>, and will be dealing with different peripherals. In

    contrast a multiprocessor system does share memory, system IQ>, and

    peripherals. But then the debate will become mur*y with the distinction between

    shared and distributed memory discussed above.

    2.2.2 /ni- and Mlti-core

    In addition to the distinction between uniprocessor and multiprocessor there is

    also the distinction between unicore and multicore processors. unicore

  • 7/26/2019 Advanced Hpc and Linux 20130820

    21/70

    processor carries out the usual functions of a 76&, according to the instruction

    set? data handling instructions 8set register values, move data, read and write:,

    arithmetic and logic functions 8add, subtract, multiply, divide, bitwise operations

    for con0unction and dis0unction, negate, compare:, and controlflow functions

    8conditionally branch to another section of a program, indirectly branch and

    return:. multicore processor carries out the same functions, but with

    independent central processing units 8note lower case: called 'cores'.

    =anufacturers integrate the multiple cores onto a single integrated circuit die or

    onto multiple dies in a single chip pac*age.

    In terms of theoretical architecture, a uniprocessor system could be multicore,

    and a multiprocessor system could be unicore. In practise the most common

    contemporary architecture is multiprocessor and multicore. +he number of cores

    is represeneted by a prefix. For example, a dualcore processor has two cores

    8e.g. = 6henom II \(, Intel 7ore uo:, a uadcore processor contains fourcores 8e.g. = 6henom II \), Intel iK, i^, and i:, a hexacore processor

    contains six cores 8e.g. = 6henom II \!, Intel 7ore i Extreme Edition 9J\:,

    an octocore processor or octacore processor contains eight cores 8e.g. Intel

    \eon E((J, = F\K^J: etc.

    2.2.3 /ni- and Mlt-Threading

    In addition to the distinctions bewteen processors and cores, whether uni or

    multi, there is also the uestion of threads. n execution thread is the smallest

    processing unit in an operating system. thread is typically contained inside a

    process. =ultiple threads can exist within the same process and share resources.

    >n a uniprocessor, multithreading generally occurs by switching between

    different threads engaging in timedivision multiplexing with the processor

    switching between the different threads, which may give the apperance that the

    as* is happening at the same time. >n a multiprocessor or multicore system,

    threads become truly concurrent, with every processor or core executing a

    separate thread simultaneously.

    2.2.$ "h9 4s 4t A Mlticore Ftre

    Ideally, don't we want clusters of multicore multiprocessors with multithreaded

    instructionsS >f course we do? but thin* of the heat that this generates, thin* of

  • 7/26/2019 Advanced Hpc and Linux 20130820

    22/70

    the potential for race conditions 8e.g., deadloc*s, data integrity issues, resource

    conflicts, interleaved execution issues:.

    +here are all fundamental problems with computer architecture.

    >ne of the reasons that multicore multiprocessor clusters have become popular

    is that cloc* rate has pretty much stalled. part from the physical reasons, it is

    uneconomical. It's simply not worth the cost increasing the freuency of cloc*

    rate in terms of the power consumed and the heat dissipitated. Intel calls the

    rateQheat tradeoff a -fundamental theorem of multicore processors-.

    %ew multicore systems are being developed all the time. &sing "I#7 76&s,

    +ilera released !)core processors in (JJ9 and in (JJ9, a one hundred core

    processor. In (J1( +ilera founder, r. garwal, is leading a new =I+ effort

    dubbed +he ngstrom 6ro0ect. It is one of four "6funded efforts aimed at

    building exascale supercomputers. +he goal is to design a chip with 1,JJJ cores.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    23/70

    2.# Para$$e$ Proce!!in% Per&ormance

    2.3.1 S%eed% and ;ocs

    6arallel programming and multicore systems should mean better performance.

    +his can be expressed a ratio called speedup

    #peedup 8p: +ime 8serial:Q +ime 8parallel:

    +his is varied by the number of processors # +81:Q+8p:, where +8p: represents

    the execution time ta*en by the program running on p processors, and +81:

    represents the time ta*en by the best serial implementation of the applicationmeasured on one processor.

    3inear, or ideal, speedup is when #8p: p. For example, double the processors

    resulting in double the speedup.

    5owever parallel programming is hard . =ore complexity more bugs.

    7orrectness in parallelisation reuires synchronisation 8loc*ing:.

    #ynchronisation and atomic operations causes loss of performance,

    communication latency. probable issue in parallel computing is deadloc*s,

    where two or more competing actions are each waiting for the other to finish,

    and thus neither ever does. n apocraphyl story of a _ansas railroad statue

    radically illustrates the problem of a deadloc*

    "hen t!o trains aroach each other at a crossing, #oth shall come to a f$ll

    sto and neither shall start $ again $ntil the other has gone%"

    8 similar example is a liveloc*? the states of the processes involved in the

    liveloc* constantly change with regard to one another, none progressing:.

    3oc*s are currrently manually inserted in typically programming languages?

    without loc*s programs can be put in aninconsistent state. =ultiple loc*s in

  • 7/26/2019 Advanced Hpc and Linux 20130820

    24/70

    different places and orders can lead to deadloc*s. =anual loc* inserts is error

    prone, tedious and difficult to maintain. oes the programmer *now what parts

    of a program will benefit from parallelisationS +o ensure that parallel execution

    is safe, a tas*Rs effects must not interfere with the execution of another tas*.

    2.3.2 Amdahl

  • 7/26/2019 Advanced Hpc and Linux 20130820

    25/70

    where each pixel is rendered independently. #uch tas*s are often called

    -pleasingly parallel-. +o give an example using the " programming language the

    #%>2 pac*age 8#imple %etwor* of 2or*stations: pac*age allows for

    embarrassingly parallel computations 8yes, we have this installed:.

    2hilst originally expressed by $ene mdahl in 19!, it wasn't until over twenty

    years later in 19 that an alternative by `ohn 3. $ustafson amd Edwin 5. Barsis

    was proposed. $ustafon noted that madahl's 3aw assumed a computation

    problem of fixed data set siCe. $ustafson and Barsis observed that programmers

    tend to set the siCe of their computational problems according to the available

    euipment? therefore as faster and more parallel euipment becomes available,

    larger problems can be solved. +hus scaled speedup occurs? although mdahl's

    law is correct in a fixed sense, it can be circumvented in practise by increasing

    the scale of the problem.

    If the problem siCe is allowed to grow with 6, then the seuential fraction of the

    wor*load would become less and less important. common metaphor is based

    on driving 8computation:, time, and distance 8computational tas*:. In mdhal's

    3aw, if a car had been travelling )J*mpQh and needs to reach a point J*m from

    the point of origin, no matter how fast the vehicle travels it will can only reach a

    maximum of a J*mQh average before reaching the J*m point, even if it

    travelled at infinite speed as the first hour has already passed. 2ith the

    $ustafonBarsis 3aw, it doesn't matter if the first hour has been at a plodding )J

    *mQh, this can be infinitely increased given enough time and distance. `ust ma*e

    the problem bigger4

  • 7/26/2019 Advanced Hpc and Linux 20130820

    26/70

    Image from Wikipedia

    #.0 'ntroductor to (P' Pro%rammin%

    3.1 The Stor9 o! the Message 8assing 4nter!ace 6M847 and %enM84

    +he =essage 6assing Interface 8=6I: is a widely used standard, initially

    designed by academia and industry initiated in 1991, to run on parallel

    computers. +he goal of the group was to ensure sourcecode portability, and as aresult they have a standard that defines an interface and specific functionality. s

    a standard, syntax and semantics are defined for core library routines which

    allow for programmers to write messagepassing programs in Fortran or 7.

    #ome implementations of these core library routine specifications are available

    as free and opensource software, such as >pen =6I. >pen =6I combined three

  • 7/26/2019 Advanced Hpc and Linux 20130820

    27/70

    previous well*nown implementations, namely F+=6I from the &niversity of

    +ennessee, 3=6I from 3os lamos %ational 3aboratory, and 3=Q=6I from

    Indiana &niversity, each of which excelled in particular areas, with additional

    contributions from the 67\=6I team at the &niversity of #tuttgart. >pen=6I

    combines the uality peerreview of a scientific free and opensource software

    pro0ect, and has been used in many of the world's top ran*ing supercomputers.

    =a0or milestones in the development of =6I include the following

    V 1991 ecision to initiate #tandards for =essage 6assing in a istributed

    =emory Environment

    V 199( 2ors*hop on the above held.

    V 199( 6reliminary draft specification released for =6I

    V 199) =6I1. #pecification, not an implementation. 3ibrary, not a language.esigned for 7 and Fortran .

    V 199 =6I(. Extends messagepassing model to include parallel IQ>, includes

    7LLQFortran9J. Interaction with threads, and more.

    V (JJ =6I Forum reconvened? =6IK development.

    V +he standard utilised in this course is =6I(.

    +he messaage passing paradigm, as it is called, is attractive as it is portable on a

    wide variety of distributed architectures, including distributed and sharedmemory multiprocessor systerms, networ*s of wor*stations, or even potentially a

    combination thereof. lthough originally designed for distributed architectures

    8unicore wor*stations connected by a common networ*: which were popular at

    the time the standard was initiated, shared memory symmetric multiprocessing

    systems over networ*s created a hybrid distributedQshared memory systems, that

    is each system has shared memory within each machine but not the memory

    distributed between machines, which distribute data over the networ*

    communications. +he =6I library standards and implementations were modified

    to handle both types of memory architectrues.

    (image from

    &a!rence

    &iermore ational

    &a#oratory, %*%+)

  • 7/26/2019 Advanced Hpc and Linux 20130820

    28/70

    &sing =6I is a matter of some common sense. It is is the only message passing

    library which can really be considered a standard. It is supported on virtually all

    567 platforms, and has replaced all previous message passing libraries, such as

    6]=, 6"=7#, E&I, %\, 7hameleon, to name a few predecessors.

    6rogrammers li*e it because there is no need to modify their source code when

    ported to a different system as long as that system also supports the =6I

    standard 8there may be other reasons however to modify the code4:. =6I has

    excellent performance with vendors able to exploit hardware features for

    optimisation.

    +he core principle is that many processors should be able cooperate to solve a

    problem by passing messages to each through a common communications

    networ*. +he flexible architecture does overcome serial bottlenec*s, but it alsodoes reuire explicit programmer effort 8the -uesting beast- of automatic

    parallelisation remains somewhat elusive:. +he programmer is responsible for

    identifying opportunities for parallelism and implementing algorithms for

    parallelisation using =6I.

    =6I programming is best where there is not too many small communications,

    and where coarselevel brea*up of tas*s or data is possible.

    "In cases !here the data layo$t is fairly simle, and the comm$nications

    atterns are reg$lar this data-arallel. is an e/cellent aroach% o!eer, !hen

    dealing !ith dynamic, irreg$lar data str$ct$res, data arallel rogramming can

    #e diffic$lt, and the end res$lt may #e a rogram !ith s$#-otimal erformance%"

    (arren, Michael *%, and 1ohn % *almon% "+ orta#le arallel article rogram%" 3om$ter hysics

    3omm$nications 57%6 (68)9 2::-20%)

    3.2 /nravelling A Sam%le M84 8rogram and %enM84 "ra%%ers

    For the purposes of this course, copy a number of files to the home directory

  • 7/26/2019 Advanced Hpc and Linux 20130820

    29/70

    cd ~

    cp -r /common/advcourse .

    In the Intermediate course, an example mpihelloworld.c program was illustrated

    with an associated 6B# script. 3etRs recall what that included and the

    explanation in the 7 program and in the 6B# script that launched it.

    +his is the text for mpihelloworld.c

    #include standard include for 7 programs.

    #include "mpi.h" standard include for =6I

    programs.int main( argc, argv ) 7eginning of the main function6

    esta5%ish arguments and vector. To

    incorporate input fi%es argc

    'argument count( is the num5er of

    arguments6 and argv 'argument

    vector( is an arra# of characters

    representing the arguments.

    int argc; Argument count is an integer

    char **argv; Argument vector is a string of

    characters.

    {

    int rank, size; "et ran8 and si9e from the inputs.

    MPI_Init( &argc, &argv );

    Initialises the =6I executionenvironment. +he input parameters

    argc is pointer to the number of

    arguments and argv is a pointer to

    the argument vector

    MPI_Comm_size( MPI_COMM_WORLD,

    &size );etermines the siCe of the group

    associated with a communicator. In

  • 7/26/2019 Advanced Hpc and Linux 20130820

    30/70

    input parameter is simply a handle

    8Contains a%% of the processes(, the

    output parameter, siCe, is an

    integer of the number of processes

    in the group.

    MPI_Comm_rank( MPI_COMM_WORLD,

    &rank );s above, except ran* is ran* of the

    calling process.

    printf( "Hello world from

    process %d of %d\n", rank, size );Printing He%%o 2or%d from each

    process.

    MPI_Finalize(); +erminates =6I execution

    environment

    return 0; A successfu% program finishes;

    }

    It is compiled into an executable with the command

    mpicc -o mpi-helloworld mpi-helloworld.c

    +his is the text for the batch file pbshelloword which is launched sub and

    reviewed with less.

    qsub pbs-helloworld

    less pbs-helloworld

    +he sample -hello world- program should be understandable to any 7

    programmer 8indeed, any programmer: and with the =6Ispecific annotations, it

    should be clear what is going on. It is the same as any other program, but with a

    few =6Ispecific additions. For example, one can chec* the 6$I mpi.h with the

    following

    less /usr/local/openmpi/1.6.3-pgi/include/mpi.h

  • 7/26/2019 Advanced Hpc and Linux 20130820

    31/70

    =6I compiler wrappers are used to compile =6I programs which perform basic

    error chec*ing, integrate the =6I include files, lin* to the =6I libraries and pass

    switches to the underlying compiler. +he wrappers are as follows

    mpif >pen =6I Fortran wrapper compiler

    mpif9J >pen =6I Fortran 9J wrapper compiler

    mpicc >pen =6I 7 wrapper compiler

    mpicxx >pen =6I 7LL wrapper compiler

    >pen =6I is comprised of three software layers >63 8>pen 6ortable ccess

    3ayer:, >"+E 8>pen "un+ime Environment:, and >=6I 8>pen =6I:. Each layerprovides the following wrapper compilers

    >63 opalcc and opalcLL

    >"+E ortecc and ortecLL

    >=6I mpicc, mpicLL, mpicxx, mpi77 8only on systems with

    casesenstive file systems:, mpif, and mpif9J. %ote that

    mpicLL, mpicxx, and mpi77 all invo*e the same underlying

    7LL compiler with the same options. ll are provided ascompatibility with other =6I implementations.

    +he distinction between Fortran and 7 routines in =6I are fairly minimal. ll the

    names of =6I routines and constants in both 7 and Fortran begin with the same

    =6IO prefix. +he main differences are

    V +he include files are slightly different in 7, mpi.h, in Fortan, mpif.h.V Fortran =6I routine names are in uppercase 8e.g., =6IOI%I+:, whereas 7

    compatible =6I routine names are upper and lowercase 8e.g., =6IOInit:.

    V +he arguments to =6IOInit are different? an =6I 7 program can ta*e advantage

    of commandline arguments.

    V +he arguments in =6I 7 functions are more strongly typed than they are in

    Fortran, resulting in specific types in 7 8e.g., =6IO7omm, =6IOatatype:

    whereas =6I Fortran uses integers.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    32/70

    V Error codes are returned in a separate argument for Fortran as opposed to the

    return value for 7 functions.

    7onsider the mpihelloworld program in Fortran 8mpihelloworld.f:

    ! Fortran MPI Hello World comment

    program hello 6rogram name

    include 'mpif.h' Include file for =6I

    integer rank, size, ierror, tag,

    status(MPI_STATUS_SIZE)]ariables

    call MPI_INIT(ierror) #tart =6I

    call

    MPI_COMM_SIZE(MPI_COMM_WORLD, size,

    ierror)

    %umber of processers

    call

    MPI_COMM_RANK(MPI_COMM_WORLD, rank,

    ierror)

    6rocess Is

    print*, 'node', rank, ': Hello

    world'

    Each processor prints -5ello

    2orld-

    call MPI_FINALIZE(ierror) Finish =6I.

    end

    7ompile this with mpi9J 8the Fortran 9J wrapper: and submit with sub

    mpif90 mpi-helloworld.f90 -o mpi-helloworld

    qsub pbs-helloworld

    +he mpihelloworld program is an example of using =6I in a manner that is

    similar to a #ingle Instruction =ultiple ata architecture. +he same instruction

  • 7/26/2019 Advanced Hpc and Linux 20130820

    33/70

    stream 8print hello world: is used across multiple times. It is perhaps best

    described as #ingle 6rogram =ultiple ata, as it obtains the effect of running the

    same program multiple times, or, if you li*e different programs with the same

    instructions.

    3.3 M84

  • 7/26/2019 Advanced Hpc and Linux 20130820

    34/70

    ;%;%6 MI==O2>"3 created. 7ommunicators are considered

    analoguous to the mail or telephone system? every message travels in the

    communicator, with every message passing call having a communcator

    argument.

    +he input paramters are argc, a pointer to the number of arguments, and argv,

    the argument vector. +hese are for 7 and 7LL only. +he Fortranonly output

    parameter is IE"">", as integer.

    +he syntax for =6IOInit8: is as follows for 7, Fortran, and 7LL.

    C S9ntax #include

    int MPI_Init(int *argc, char ***argv)

    Fortran S9ntax

  • 7/26/2019 Advanced Hpc and Linux 20130820

    35/70

    INCLUDE mpif.h

    MPI_INIT(IERROR)

    INTEGER IERROR

    C>> S9ntax

    #include void MPI::Init(int& argc, char**& argv) void MPI::Init()

    ;%;%2 MI"3. +he input parameter is comm, which the handle for the

    communicator, and the output paramter is siCe, the number of processes in thegroup of comm 8integer: and the Fortran only IE"">" providing the error status

    as integer.

    communicator is effectively a collection of processes that can send messages

    to each other. 2ithin programs many communications also depend on the

    number of processes executing the program.

    +he syntax for =6IO7ommOsiCe8: is as follows for 7, Fortran, and 7LL.

    C S9ntax #include

    int MPI_Comm_size(MPI_Comm comm, int *size)

    Fortran S9ntax INCLUDE mpif.h

    MPI_COMM_SIZE(COMM, SIZE, IERROR)

    INTEGER COMM, SIZE, IERROR

    C>> S9ntax #include

    int Comm::Get_size() const

  • 7/26/2019 Advanced Hpc and Linux 20130820

    36/70

    ;%;%; MI" error status

    for Fortran. It is common for =6I programs to be written in a managerQwor*er

    model, where one process 8typically ran* J: acts in a supervisory role, and the

    other processes act in a computational role.

    +he syntax for =6IO7ommOran*8: is as follows for 7, Fortran, and 7LL.

    C S9ntax #include

    int MPI_Comm_rank(MPI_Comm comm, int *rank)

    Fortran S9ntax INCLUDE mpif.h

    MPI_COMM_RANK(COMM, RANK, IERROR)

    INTEGER COMM, RANK, IERROR

    C>> S9ntax #include

    int Comm::Get_rank() const

    ;%;%4 MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    37/70

    implementation. +he messagepassing system ta*e care of delivery. 5owever this

    Mappropriate wayN means stating various characteristics of the message 0ust li*e

    the post or email? who is sending it, where itRs being sent to, what itRs about, and

    so forth.

    +he input parameters include buf, the initial address of the send buffer., count,

    an integer of the number of elements., datatype, a handle of the datatype of each

    send buffer., dest, an integer ran* of the destination., tag, an integer message

    tag, and comm, the communicator handle. +he only output parameter is

    Fortran's , IE"">".

    If =6I7omm represents a community of addressable space, then =6I#end and

    =6I"ecv the envelope, addressing information and the data. In order for a

    message to be successfully communicated the system must append someinformation to the data that the application program wishes to transmit. +his

    includes the ran* of the sender, the receiver, a tag, and the communicator. +he

    source is used to differentiate messages received from different sources? the tag

    to distinguish messages from a single process.

    +he syntax for =6IO#end8: is as follows for 7, Fortran, and 7LL.

    C S9ntax

    Uinclude ;mpi.h> S9ntax #include

    void Comm::Send(const void* buf, int count, const Datatype&

    datatype, int dest, int tag) const

  • 7/26/2019 Advanced Hpc and Linux 20130820

    38/70

    ;%;%8 MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    39/70

    void Comm::Recv(void* buf, int count, const Datatype& datatype,

    int source, int tag) const

    +he importance of =6IO#end8: and =6IO"ecv8: refers to the nature of process

    variables, which remain private unless passed by =6I in the 7ommunications

    2orld.

    ;%;%: MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    40/70

    int MPI_Finalize()

    Fortran S9ntax INCLUDE mpif.h

    MPI_FINALIZE(IERROR)

    INTEGER IERROR

    C>> S9ntax #include

    void Finalize()

    2hilst the previous mpihelloworld.c and the mpihelloworld.f9J examples

    illustrated the use of four of the six core routines of =6I, it did not illustrate the

    use of the =6IO"ecv and =6IO#end routines. +he following program, of nogreater complexity, does this. +here is no need to provide additional explanation

    of what is happening, as this should be discerned from the routine explanations

    given. Each program should be compiled with mpicc and mpif9J respectively,

    submitted with sub, with the results chec*ed.

    7ompile with mpicc -o mpi-sendrecv mpi-sendrecv.c, submit with qsub pbs-sendrecv

    #include

    #include #include

    int main(argc,argv)

    int argc;

    char *argv[];

    {

    int myid, numprocs;

    int tag,source,destination,count;

    int buffer;

    MPI_Status status;

    MPI_Init(&argc,&argv);

    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);

    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    tag=1;

    source=0;

    destination=1;

  • 7/26/2019 Advanced Hpc and Linux 20130820

    41/70

    count=1;

    if(myid == source){

    buffer=1234;

    MPI_Send(&buffer,count,MPI_INT,destination,tag,MPI_COMM_WORLD);

    printf("processor %d sent %d\n",myid,buffer);

    }

    if(myid == destination){

    MPI_Recv(&buffer,count,MPI_INT,source,tag,MPI_COMM_WORLD,&status);printf("processor %d received %d\n",myid,buffer);

    }

    MPI_Finalize();

    }

    +he mpisendrecv.f program? compile with mpif9J mpisendrecv.f9J, submit with

    sub pbssendrecv

    program sendrecv

    include "mpif.h"

    integer myid, ierr,numprocs

    integer tag,source,destination,count

    integer buffer

    integer status(MPI_STATUS_SIZE)

    call MPI_INIT( ierr )

    call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr )

    call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr )

    tag=1

    source=0

    destination=1count=1

    if(myid .eq. source)then

    buffer=1234

    Call MPI_Send(buffer, count, MPI_INTEGER,destination,&

    tag, MPI_COMM_WORLD, ierr)

    write(*,*)"processor ",myid," sent ",buffer

    endif

    if(myid .eq. destination)then

    Call MPI_Recv(buffer, count, MPI_INTEGER,source,&

    tag, MPI_COMM_WORLD, status,ierr)

    write(*,*)"processor ",myid," received ",buffer

    endifcall MPI_FINALIZE(ierr)

    stop

    end

    +he following provides a summary use of the six core routines in 7 and Fortran.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    42/70

    Purpo!e C Fortran

    &nc%ude header fi%es #include INCLUDE mpif.h

    &nitia%i9e P&int MPI_Init(int *argc, char

    ***argv)

    INTEGER IERROR

    CALL MPI_INIT(IERROR)

    etermine num5er

    of processes 2ithin

    a communicator

    int MPI_Comm_size(MPI_Comm

    comm, int *size)

    INTEGER COMM,SIZE,IERRORCALL

    MPI_COMM_SIZE(COMM,SIZE,IERROR)

    etermine processor

    ran8 2ithin a

    communicator

    int MPI_Comm_rank(MPI_Comm

    comm, int *rank)

    INTEGER COMM,RANK,IERROR

    CALL

    MPI_COMM_RANK(COMM,RANK,IERROR)

    "end a message

    int MPI_Send (void *buf,int

    count, MPI_Datatype

    datatype, int dest, int tag,

    MPI_Comm comm)

    BUF(*)

    INTEGER COUNT,

    DATATYPE,DEST,TAG

    INTEGER COMM, IERROR

    CALL MPI_SEND(BUF,COUNT,DATATYPE, DEST, TAG, COMM,

    IERROR)

    ,eceive a message

    int MPI_Recv (void *buf,int

    count, MPI_Datatype

    datatype, int source, int

    tag, MPI_Comm comm,

    MPI_Status *status)

    BUF(*)

    INTEGER COUNT, DATATYPE,

    SOURCE,TAG

    INTEGER COMM, STATUS, IERROR

    CALL MPI_RECV(BUF,COUNT,

    DATATYPE, SOURCE, TAG, COMM,

    STATUS, IERROR)

    Exit P&int MPI_Finalize() CALL MPI_FINALIZE(IERROR)

    $.? 4ntermediate M84 8rogramming

    $.1 M84 5atat9%es

    3i*e 7 and Fortran 8and indeed, almost every programming language that comes

    to mind:, =6I has datatypes, a classification for identifying different types of

    data 8such as real, int, float, char etc:. In the introductory =6I program there

    wasnRt really much complexity in these types? as one delves deeper however

    more will be encountered. Forewarned is forearmed, so the following provides a

    handy comparison chart between =6I, 7, and Fortran.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    43/70

    M84 5ATAT+8E FTA@ 5ATAT+8E

    MPI_INTEGER INTEGER

    MPI_REAL REAL

    MPI_DOUBLE_PRECISION DOUBLE PRECISION

    MPI_COMPLEX COMPLEX

    MPI_LOGICAL LOGICAL

    MPI_CHARACTER CHARACTER

    MPI_BYTE

    MPI_PACKED

    M84 5ATAT+8E C 5atat9%e

    MPI_CHAR signed char

    MPI_SHORT signed short int

    MPI_LONG signed long int

    MPI_UNSIGNED_CHAR unsigned char

    MPI_UNSIGNED_SHORT unsigned short int

    MPI_UNSIGNED unsigned int

    MPI_UNSIGNED_LONG unsigned long int

    MPI_FLOAT float

    MPI_DOUBLE double

    MPI_LONG_DOUBLE long double

    MPI_BYTE

    MPI_PACKED

    $.2 4ntermediate otines

    In the Intermediate course one of the last excersises involved the submission of

    mpiping and mpipong. +he first simply tested whether a connection existed

    between multiple processors. +he second program tested different pac*et siCes,

    asynchronous, and bidirectional. In this example there is pingOpong.c, from the

  • 7/26/2019 Advanced Hpc and Linux 20130820

    44/70

    &niversity of Edinburgh 6arallel 7omputing 7entre, and a fortran 9J version of

    the same from 7olorado &niversity. +he usual methods can be used for compiling

    and submitting these programs, e.g.,

    mpicc -o mpi-pingpong mpi-pingpong.c ormpicc mpi-pingpong.f90 -o mpi-pingpong and

    qsub pbs-pingpong

    5owever for this course the interesting components is what is inside the code in

    terms of the =6I routines. s previously there is the mpi.h include files, the

    initialisation routines, the establishment of a communications world and so forth.

    In addition however there are some new routines, specifically =6IO2time,=6IObort, and =6IO#send.

    4%2%6 MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    45/70

    C S9ntax

    #include

    double MPI_Wtime()

    Fortran S9ntax

    INCLUDE mpif.h

    DOUBLE PRECISION MPI_WTIME()

    C>> S9ntax

    #include

    double MPI::Wtime()

    4%2%2 MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    46/70

    #include int MPI_Abort(MPI_Comm comm, int errorcode)

    Fortran S9ntaxINCLUDE mpif.h

    MPI_ABORT(COMM, ERRORCODE, IERROR)INTEGER COMM, ERRORCODE, IERROR

    C>> S9ntax#include

    void Comm::Abort(int errorcode)

    4%2%; MItherwise, =6IO#end is the more

    flexible option.

    +he available input parameters include buf, the initial addess of the send buffer.,

    count., an nonnegative integer of the number of elements in the send buffer.,datatype, a datatype of each send buffer element as a handle., dest, an integer

    ran* of destination., tag, a message tag represented as an integer, and comm,

    the communicator handle. +he only output paramter is FortranRs IE""">".

    +he syntax for =6IO#send8: is as follows for 7, Fortran, and 7LL.

    C S9ntax

    #include

    int MPI_Ssend(void *buf, int count, MPI_Datatype datatype, int dest,

    int tag, MPI_Comm comm)

  • 7/26/2019 Advanced Hpc and Linux 20130820

    47/70

    Fortran S9ntax

    INCLUDE mpif.h

    MPI_SSEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, IERROR)

    BUF(*)

    INTEGER COUNT, DATATYPE, DEST, TAG, COMM, IERROR

    C>> S9ntax

    #include

    void Comm::Ssend(const void* buf, int count, const Datatype&

    datatype, int dest, int tag) const

    4%2%4 Other *end and Rec Ro$tines

    lthough not used in the specific program 0ust illustrated there are actually a

    number of other send options for >pen =6I. +hese include =6IOBsend ,

    =6IO"send, =6IOIsend, =6IOIbsend, =6IOIssend, and =6IOIrsend. +hese are

    worth mentioning in summary as follows

    MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    48/70

    indicates to the system to start copying data out of the send buffer. send

    reuest can be determined being completed by calling the =6IO2ait,

    =6IO2aitany, =6IO+est, or =6IO+estany with reuest returned by this function.

    +he send buffer cannot be used until one of these conditions is successful, or an

    =6IO"euestOfree indicates that the buffer is available.

    MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    49/70

    int tag, MPI_Comm comm, MPI_Request *request)

    Fortran S9ntax INCLUDE mpif.h MPI_ISEND(BUF, COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR)

    BUF(*) INTEGER COUNT, DATATYPE, DEST, TAG, COMM, REQUEST, IERROR

    C>> S9ntax #include Request Comm::Isend(const void* buf, int count, const

    Datatype& datatype, int dest, int tag) const

    MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    50/70

    MPI_IRECV(BUF, COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST,

    IERROR)

    BUF(*)

    INTEGER COUNT, DATATYPE, SOURCE, TAG, COMM, REQUEST, IERROR

    C>> S9ntax

    MI> S9ntax #include void Request::Wait(Status& status)

    void Request::Wait()

  • 7/26/2019 Advanced Hpc and Linux 20130820

    51/70

    A Smmar9 o! Some ther M84 Sendeceive Modes

    Send Mode Ex%lanation Bene!its 8rolemsMPI_Send() #tandard send.

    =ay be

    synchronous or

    buffering

    Flexible tradeoff?

    automatically

    uses buffer if

    available, but

    goes for

    synchronous if

    not.

    7an hide

    deadloc*s,

    uncertainty of

    type ma*es

    debugging

    harder.

    MPI_Ssend() #ynchronous

    send. oesn'treturn until

    receive has also

    completed.

    #afest mode,

    confident thatmessage has

    been received.

    3ower

    performance,especially

    without non

    bloc*ing.

    MPI_Bsend() Buffered send.

    7opies data to

    buffer, program

    free to continue

    whilst message

    delivered later.

    $ood

    performance.

    %eed to be

    aware of buffer

    space.

    Buffer

    management

    issues.

    MPI_Rsend() "eceive send.

    =essage must be

    already posted or

    is lost.

    #light

    performance

    increase since

    there's no

    handsha*e.

    "is*y and

    difficult to

    design.

    s described previously the arguments dest and source in the various modes of

    send are the ran*s of the receiving and the sending processes. =6I also allows

    source to be a -wildcard- through the predefined constant =6IO%TO#>&"7E8to receive from any source: and =6IO%TO+$ 8to receive with any source:.

    +here is no wildcard for dest. gain using the postal analogy, a receipient may be

    ready to receive a message from anyone, but they can't send a message to

    anywhere4

    4%2%8 Ahe risonerBs Cilemma

  • 7/26/2019 Advanced Hpc and Linux 20130820

    52/70

    +he example of the 6risonerRs ilemma 8cooperation vs competition: is provided

    as an example to illustrate how nonbloc*ing communications wor*. It in this

    example, there are ten rounds between two players. +here are different payoffs

    for each. In this particular version the distinction is between cooperation and

    competition for financial rewards. If both players cooperate they receive Y( for

    the round. If they both compete, they receive Y1 each for the round. But if one

    adopts a competitive stance and the other a cooperative stance, the competitor

    receives YK and the cooperative player nothing.

    serial version of the code is provided 8serialgametheory.c, serial

    gametheory.f9J:."eview and then attempt a parallel version from the s*eleton

    versions of =6I 8mpis*elgametheory.c, mpis*elgametheory.f9J:.Each process

    must run one of the players decisionma*ing, then they both have to transmittheir decision to the other, and then update their own tally of the result. 7onsider

    using =6IO#end8:, or =6IOIrecv8:, and =6IO2ait8:. >n completion review with a

    solution provided with mpigametheory.c and mpigametheory.f9J and submit the

    tas*s with sub.

    ).# Co$$ective Communication!

    =6I can also conduct collective communications. +hese include =6IOBroadcast,

    =6IO#catter, =6IO$ather, =6IO"educe, and =6IOllreduce. brief summary of

    their syntax and description of effects is provided before a practical example.

    +he basic principle and motivation is that whilst collective communications this

    may provide a performance improvement, it will certainly provide clearer code.

    7onsider the following 7 snippet of a root processor sending to all..

    if ( 0 == rank ) {

    unsigned int proc_I;

    for ( proc_I=1; proc_I < numProcs; proc_I++ ) {

    MPI_Ssend( &param, 1, MPI_UNSIGNED, proc_I, PARAM_TAG, MPI_COMM_WORLD );

    }

    }

    else {

    MPI_Recv( &param, 1, MPI_UNSIGNED, 0 /*ROOT*/, PARAM_TAG, MPI_COMM_WORLD, &status

    );

  • 7/26/2019 Advanced Hpc and Linux 20130820

    53/70

    }

    "eplaced with

    MPI_Bcast( &param, 1, MPI_UNSIGNED, 0/*ROOT*/, MPI_COMM_WORLD );

    4%;%6 MI".

    +he syntax for =6IOBroadcast8: is as follows for 7, Fortran, and 7LL.

    C S9ntax#include

    int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,

    int root, MPI_Comm comm)

    Fortran S9ntaxINCLUDE mpif.hMPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM, IERROR)

    BUFFER(*)

    INTEGER COUNT, DATATYPE, ROOT, COMM, IERROR

    C>> S9ntax#include

  • 7/26/2019 Advanced Hpc and Linux 20130820

    54/70

    void MPI::Comm::Bcast(void* buffer, int count,

    const MPI::Datatype& datatype, int root) const = 0

    4%;%2 MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    55/70

    +he syntax for =6IO#catter8: is as follows for 7, Fortran, and 7LL.

    C S9ntax

    #include

    int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype sendtype,

    void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,

    MPI_Comm comm)

    Fortran S9ntax

    INCLUDE mpif.h

    MPI_SCATTER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT, RECVTYPE, ROOT, COMM, IERROR)

    SENDBUF(*), RECVBUF(*)

    INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT

    INTEGER COMM, IERROR

    C>> S9ntax

    #include

    void MPI::Comm::Scatter(const void* sendbuf, int sendcount,

    const MPI::Datatype& sendtype, void* recvbuf,

    int recvcount, const MPI::Datatype& recvtype,

    int root) const

    4%;%; MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    56/70

    +he input parameters include sendbuff, the address of the send buffer.,

    sendcount, an integer of the number of elements in the send buff., sendtype, the

    datatype handle send buffer elements., recvcount, an root integer of the number

    of elements in the receive buffer., rectype, the datatype handle for root of receive

    buffer elements., root, the integer ran* of the sending process., and comm, the

    communicator handle. +he output paramters include recbuf, the address of the

    receive buff for root and the everdependable IE"">" for Fortran,

    +he syntax for =6IO$ather8: is as follows for 7, Fortran, and 7LL.

    C S9ntax

    #include

    int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype,

    void *recvbuf, int recvcount, MPI_Datatype recvtype, int root,

    MPI_Comm comm)

    Fortran S9ntax

    INCLUDE mpif.h

    MPI_GATHER(SENDBUF, SENDCOUNT, SENDTYPE, RECVBUF, RECVCOUNT,

    RECVTYPE, ROOT, COMM, IERROR)

    SENDBUF(*), RECVBUF(*)

    INTEGER SENDCOUNT, SENDTYPE, RECVCOUNT, RECVTYPE, ROOT

    INTEGER COMM, IERROR

  • 7/26/2019 Advanced Hpc and Linux 20130820

    57/70

    C>> S9ntax

    #include

    void MPI::Comm::Gather(const void* sendbuf, int sendcount,

    const MPI::Datatype& sendtype, void* recvbuf,

    int recvcount, const MPI::Datatype& recvtype, int root,

    const = 0

    4%;%4 MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    58/70

    integer number of elements in the send buffer., datatype, a handle of the

    datatype of elements in the send buffers., op, a handle of the reduce operation.,

    root, the integrer ran* of the root process., comm, the communicator handle. +he

    output paramters are recvbug, the address of the receive buffer for root, and

    Fortran's IE"">".

    +he syntax for =6IO"educe8: is as follows for 7, Fortran, and 7LL.

    C S9ntax

    #include

    int MPI_Reduce(void *sendbuf, void *recvbuf, int count,

    MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

    Fortran S9ntax

    INCLUDE mpif.h

    MPI_REDUCE(SENDBUF, RECVBUF, COUNT, DATATYPE, OP, ROOT, COMM,

    IERROR)

    SENDBUF(*), RECVBUF(*)

    INTEGER COUNT, DATATYPE, OP, ROOT, COMM, IERROR

    C>> S9ntax

    #include

    void MPI::Intracomm::Reduce(const void* sendbuf, void* recvbuf,

    int count, const MPI::Datatype& datatype, const MPI::Op& op,

    int root) const

    =6I reduction operations include the following

    M84@ame Fnction

    MPI_Max =aximum

    MPI_MIN =inimum

  • 7/26/2019 Advanced Hpc and Linux 20130820

    59/70

    MPI_SUM #um

    MPI_PROD 6roduct

    MPI_LAND 3ogical %

    MPI_BAND Bitwise %

    MPI_LOR 3ogical >"

    MPI_BOR Bitwise >"

    MPI_LXOR 3ogical exclusive >"

    MPI_BXOR Bitwise exclusive >"

    MPI_MAXLOC =aximum and location

    MPI_MINLOC =iniumun and location

    $.3.& ther Collective Commnications

    >ther collective communications include

    MI

  • 7/26/2019 Advanced Hpc and Linux 20130820

    60/70

    +he program could send the data one at a time e.g.,

    double results[5][5];

    int i;

    for ( i = 0; i < 5; i++ ) {

    MPI_Send( &(results[i][0]), 1, MPI_DOUBLE,dest, tag, comm );

    }

    But this has overhead? message passing is always 8relatively: expensive. #o

    instead, a datatype can be created that informs =6I how the data is stored so it

    can be sent in one routine.

    +o create a derived type there are two steps Firstly construct the datatype, with=6IO+ypeOvector8: or =6IO+ypeOstruct8: and and then commit the datatype with

    =6IO+ypeO7ommit8:.

    2hen all the data to send is the same data type use the vector method e.g.,

    int MPI_Type_vector( int count, int blocklen, int stride, MPI_Datatype old_type,

    MPI_Datatype* newtype )

    /* Send the first double of each of the 5 rows */

    MPI_Datatype newType;

    double results[5][5];

    MPI_Type_vector( 5, 1, 5, MPI_Double, &newType);

    MPI_Type_commit( &newType );

    MPI_Ssend( &(results[0][0]), 1, newType, dest, tag, comm );

    %ote that when sending a vector, data on receiving processor may be of adifferent type eg

    double recvData[COUNT*BLOCKLEN];

    double sendData[COUNT][STRIDE];

    MPI_Datatype vecType;

    MPI_Status st;

  • 7/26/2019 Advanced Hpc and Linux 20130820

    61/70

    MPI_Type_vector( COUNT, BLOCKLEN, STRIDE, MPI_DOUBLE, &vecType );

    MPI_Type_commit( &vecType );

    if( rank == 0 )

    MPI_Send( &(sendData[0][0]), 1, vecType, 1, tag, comm );

    else

    MPI_Recv( recvData, COUNT*BLOCKLEN, MPI_DOUBLE, 0, tag, comm, &st );

    If you have specific parts of a struct you wish to send and the members are of

    different types, use the struct datatype.

    int MPI_Type_struct( int count, int blocklen[], MPI_Aint indices, MPI_Datatype

    old_types[],MPI_Datatype* newtype )

    For example....

    /* Send the Packet structure in a message */

    struct {

    int a;

    double array[3];

    char b[10];

    } Packet;

    struct Packet dataToSend;

    nother example .

    int blockLens[3] = { 1, 3, 10 };

    MPI_Aint intSize, doubleSize;

    MPI_Aint displacements[3];

    MPI_Datatype types[3] = { MPI_INT, MPI_DOUBLE, MPI_CHAR };

    MPI_Datatype myType;

    MPI_Type_extent( MPI_INT, &intSize ); //# of bytes in an int

    MPI_Type_extent( MPI_DOUBLE, &doubleSize ); // double

    displacements[0] = (MPI_Aint) 0;

    displacements[1] = intSize;

    displacements[2] = intSize + ((MPI_Aint) 3 * doubleSize);

  • 7/26/2019 Advanced Hpc and Linux 20130820

    62/70

    MPI_Type_struct( 3, blockLens, displacements, types, &myType );

    MPI_Type_commit( &myType );

    MPI_Ssend( &dataToSend, 1, myType, dest, tag, comm );

    +here are actually other functions for creating derived types

    MPI_Type_contiguous

    MPI_Type_hvector

    MPI_Type_indexed

    MPI_Type_hindexed

    In many applications, the siCe of a message to receive is un*nown before it is

    received. 8e.g. number of particles moving between domains:. =6I has a way of

    dealing with this elegantly. Firstly, receive side calls =6IO6robe before actuallyreceiving

    int MPI_Probe( int source, int tag, MPI_Comm comm, MPI_Status *status )

    Can then examine the status, and find length using:

    int MPI_Get_count( MPI_Status *status,

    MPI_Datatype datatype, int *count )

    +hen the application dynamically allocate the recv buffer, and call =6IO"ecv.

    &.& 8article Advector

    +he particle advector handson excersise consists of two parts.

    +he first example is designed to gain familiarity with the =6IO#catter8: routine

    as a means of distributing global arrays among multiple processesors viacollective commuinication. &se the s*eleton code provided and determine the

    number of particles to assign to each processor. +hen use the function

    =6IO#catter8: to spread the global particle coordinates, ids and tags among the

    processors.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    63/70

    For an advanced tests, on the root processor only, calculate the particle with the

    smallest distance from the origin 8hint =6IO"educe8 : :. If the particle with the

    smallest distance is ; 1.J from the origin, then flip the direction of movement of

    all the particles. +hen modify your code to use the =6IO#catterv8: function to

    allow the given number of particles to be properly distributed among a variable

    number of processors.

    int MPI_Scatterv (

    void *sendbuf,

    int *sendcnts,

    int *displs,

    MPI_Datatype sendtype,

    void *recvbuf,

    int recvcnt,

    MPI_Datatype recvtype,

    int root,

    MPI_Comm comm )

    +he second example is designed to gain a practical example of the use of =6I

    derived data types. Implement a data type storing the particle information from

    the previous exercise and use this data type for collective communications. #et

    up and commit a new =6I derived data type, based on the struct below

    typedef struct Particle {

    unsigned int globalId;

    unsigned int tag;

    Coord coord;

    } Particle;

    5int =6IO+ypeOstruct8 :, =6IO+ypeOcommit8 :

    +hen seed the random number seuence on the root processor only, and

    determine how many particles are to be assigned among the respective

    processors 8same as for last exercise: and collectively assign their data using the

    =6I derived data type you have implemented.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    64/70

    &.' Creating A @e# Commnicator

    2hen creating a new communicator, each communicator has associated with it a

    group of ran*ed processes. Before creating a new comunicator, first we must

    create a group for it. 7reate a new group be eliminating processes from an

    existing group

    MPI_Group worldGroup, subGroup;

    MPI_Comm subComm;

    int *procsToExcl, numToExcl;

    MPI_Comm_group( MPI_COMM_WORLD, &worldGroup );

    MPI_Group_excl( worldGroup, numToExcl, procsToExcl, &subGroup );

    MPI_Comm_create( MPI_COMM_WORLD, subGroup, &subComm );

    &.( 8ro!iling 8arallel 8rograms

    6arallel 6erformance Issues include the following

    V 7overage Z of the code that is parallel

    V $ranularity mount of wor* in each section

    V 3oad Balancing

    V 3ocality 7ommunication structure

    V #ynchroniCation 3oc*ing latencies

    #ince the performance of parallel programs are dependant on so many issues, it

    is an inherently difficult tas* to profile parallel programs.

    +& 8+uning and nalysis &tilities: is a portable profiling and tracing tool*it for

    performance analysis of parallel programs written in `ava, 7, 7LL and Fortran.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    65/70

    +he steps involved in profiling parallel code are outlined as follows

    Instrument the source code with +au macros

    7ompile the instrumented code

    "un the program to view profile.V files for each separate process

    +he instrumentation of source code can be done manually or with the help of

    another utility called 6+, which automatically parses source files and

    instruments them with +au macros.

    &.) 5egging M84 A%%lications

    It has ta*en many years for this essential truth to be realised, but software

    euals bugs. In parallel systems, the bugs are particularly difficult to diagnose,

    and the core principle of parallelisation suggests race conditions and deadloc*s.

    For example, what happens when two processers try to send a message to one

    another at the same time.

    2hen debugging =6I programs it is usually a good idea to do this in oneRs own

    environment, i.e., install 8from source: the compilers and version of openmpi on

    your own system. +he reason for this is it is uite time prohibitive to conduct

    debugging activities on a batchprocessing highperformance computer. +he 567

    systems that we have may run tas*s fairly uic*ly when launched, but they can

    ta*e some time to begin whilst they are in the ueue.

    5 @T /@ BS @ TDE DEA5 @5E

    REALLY, 5 @T /@ M/;T4CE BS @ TDE DEA5 @5E

    It is possible, for small tests, to bypass this by running small 0obs interactively

    8following the instructions given in the Intermediate course:. e,g,

  • 7/26/2019 Advanced Hpc and Linux 20130820

    66/70

    qsub -l walltime=0:30:0,nodes=1:ppn=2 -I

    module load vpac

    qsub pbs-sendrecv

    In general however, parallel programs are hard to program and hard to debug.

    6arallelism adds a whole new abstract layer. lthough the program is being

    executed onprocessors, it may be running inslightly different ways on

    different data.

    lthough time consuming serious it is usually appropriate to build the code in

    serial first to the point thatRs itRs wor*ing, and wor*ing well. s part of thisprocess use versioning control systems, engage in unit 8chec* each functional

    component of the code independently: and integration testing 8chec* the

    interfaces between components: as part of this development. &se standard

    methods for these tests, such as the use of midrange, boundary, and outof

    bounds variables.

    Because parallelism adds a new level of abstraction, producing a serial version of

    a code before producing a parallel version is not unli*e producing pseudocode

    for a serial program. +ime and time again it has been shown that modelling

    significantly improves the uality of a program and reduces errors, thus saving

    time in the longer run. In the process of engaging in such modelling, developing

    a defensive style of programming is effective, for example engaging in the

    techniues that prevent deadloc*s, or *eeping in consideration the state of a

    condition when running loops or ifelse statements. 2hen conducting actual tests

    on the code, a tactically placed printf or write statements will assist.

    For example, consider the following simple sendrecv programs? compile these

    with openmpigcc as follows

    module load openmpi-gcc

  • 7/26/2019 Advanced Hpc and Linux 20130820

    67/70

    mpicc -g mpi-debug.c -o mpi-debug or

    mpif90 -g mpi.debug.f90 -o mpi-debug

    qsub -l walltime=0:20:0,nodes=1:ppn=2 -I

    module load vpac

    module load valgrind/3.8.1-openmpi-gcc

    %ote that an interactive 0ob starts the user in their home directory reuring a

    change in directories.

    2hen mpiexec with ( processors is launched with valgrind debugging the

    executable and with error output redirected to valgrind.out.

    mpiexec -np 2 valgrind ./mpi-sendrecv-debug 2> valgrind.out

    ]algrind is a debugging suite that automatically detects many memory

    management and threading bugs. 2hilst typically built for serial applications, it

    can also be built with mpicc wrappers, but currently only for $%& $77 or IntelRs

    7LL compiler. It is important to use the same compiler used in both the build

    and the ]algrind test.

    +he file valgrind.out in this case will contain uite a few errors, but none of these

    are critical to the operation of the program.

    s with serial programs, gdb can also be used for thorough debugging. Execute

    as

    mpiexec -np [number of processers] gdb ./executable command=gdb.cmd

  • 7/26/2019 Advanced Hpc and Linux 20130820

    68/70

    2here gdb.cmd is a text file of the commands that you want to send to gdb. e.g.,

    module load gdb

    mpiexec -np 2 gdb --exec=mpi-debug --command=gdb.cmd

    2hich should generate a result something li*e the following

    [lev@trifid166 advancedhpc]$ mpiexec -np 2 gdb commands=gdb.cmd mpi-debug

    (remove license information)

    Reading symbols from /nfs/user2/lev/programming/advancedhpc/mpi-debug...(no

    debugging symbols found)...done.

    Reading symbols from /nfs/user2/lev/programming/advancedhpc/mpi-debug...(no

    debugging symbols found)...done.

    [Thread debugging using libthread_db enabled]

    Using host libthread_db library "/lib64/libthread_db.so.1".

    [Thread debugging using libthread_db enabled]

    Using host libthread_db library "/lib64/libthread_db.so.1".

    [New Thread 0x2aaaad875700 (LWP 19784)]

    [New Thread 0x2aaaad875700 (LWP 19785)]

    [New Thread 0x2aaaadc8b700 (LWP 19786)]

    [New Thread 0x2aaaadc8b700 (LWP 19787)]

    processor 0 final value: 324 with loop # 68

    processor 1 final value: 2346 with loop # 68

    [Thread 0x2aaaadc8b700 (LWP 19786) exited]

    [Thread 0x2aaaad875700 (LWP 19784) exited][Thread 0x2aaaadc8b700 (LWP 19787) exited]

    [Thread 0x2aaaad875700 (LWP 19785) exited]

    [Inferior 1 (process 19776) exited normally]

    [Inferior 1 (process 19777) exited normally]

  • 7/26/2019 Advanced Hpc and Linux 20130820

    69/70

    +his of course, simple mentions that the program successfully with the final

    values as listed 8hooray4:. +o use a serial debugger with $B that is running in

    parallel is slightly more difficult. common hac* @ and it is a hac* @ is to find out

    what process Is that the 0ob is doing then to log in to the appropriate node and

    run gdb p 6I. 5owever in order to discover that the following code snippet is

    usually implemented

    {

    int i = 0;

    char hostname[256];

    gethostname(hostname, sizeof(hostname));

    printf("PID %d on %s ready for attach\n", getpid(), hostname);

    fflush(stdout);

    while (0 == i)

    sleep(5);

    }

    +hen at 0ob submission those 6Is will be displayed. For example,

    [lev@trifid166 advancedhpc]$ mpiexec -np 2 mpi-debug

    PID 23166 on trifid166 ready for attach

    PID 23167 on trifid166 ready for attach

    +hen login to the appropriate nodes and run gdb p (K1!! and gdb p (K1! amd

    step through the function stac* and set the variable to a nonCero value, e.g.,

    (gdb) set var i = 7

    +hen set a brea*point after your bloc* of code and continue execution until the

    brea*point is hit 8e.g., by adding brea* in the loops on lines )9 and !): and using

    the gdb commands to display the values as they are being generated 8e.g., print

    loop, print value, or info locals:.

  • 7/26/2019 Advanced Hpc and Linux 20130820

    70/70

    110 !/

    --/ ?A@: =1 $ >>!/ --

    infovpac.org***.vpac.or%

    mailto:[email protected]:[email protected]