clase 4 - tutorial de mpi

Upload: enzo-burga

Post on 08-Aug-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 Clase 4 - Tutorial de MPI

    1/35

    Beginner MPI Tutorial

    Welcome to the MPI tutorial for beginners! In this tutorial, you will learn all of the

    basic concepts of MPI by going through various examples. The different parts of the

    tutorial are meant to build on top of one another. If you feel lost during any lesson, feelfree to leave a comment on the post explaining your dilemma. Either myself or another

    MPI expert will likely be able to get back to you soon.

    This beginning tutorial assumes that the reader has a general knowledge of how parallel

    programming works, has experience with working on a Linux system, and can also

    understand the C programming language.

    Introduction

    MPI Introduction Installing MPICH2 Running an MPI Hello World Application

    Blocking Point-to-Point Communication

    MPI Send and Receive Dynamic Receiving with MPI_Probe (and MPI_Status) Point-to-Point Communication Application Example Random Walk

    Collective Communication

    MPI Broadcast and Collective Communication MPI Scatter, Gather, and Allgather

    MPI Introduction

    The Message Passing Interface (MPI) first appeared as a standard in 1994 for

    performing distributed-memory parallel computing. Since then, it has become the

    dominant model for high-performance computing, and it is used widely in research,

    academia, and industry.

    The functionality of MPI is extremely rich, offering the programmer with the ability to

    perform: point-to-point communication, collective communication, one-sided

    communication, parallel I/O, and even dynamic process management. These terms

    probably sound quite strange to a beginner, but by the end of all of the tutorials, the

    terminology will be common place.

    Before starting the tutorials, familiarize yourself with the basic concepts below. These

    are all related to MPI, and many of these concepts are referred to throughout the

    tutorials.

    The Message Passing Model

  • 8/22/2019 Clase 4 - Tutorial de MPI

    2/35

    The message passing model is a model of parallel programming in which processes can

    only share data by messages. MPI adheres to this model. If one process wishes to

    transfer data to another, it must initiate a message and explicitly send data to that

    process. The other process will also have to explicitly receive the other message (except

    in the case of one-sided communication, but we will get to that later).

    Forcing communication to happen in this way offers several advantages for parallel

    programs. For example, the message passing model is portable across a wide range of

    architectures. An MPI program can run across computers that are spread across the

    globe and connected by the internet, or it can execute on tightly-coupled clusters. An

    MPI program can even run on the cores of a shared-memory processor and pass

    messages through the shared memory. All of these details are abstracted by the

    interface. The debugging of these programs is often easier too, since one does not need

    to worry about processes overwriting the address space of another.

    MPIs Design for the Message Passing Model

    MPI has a couple classic concepts that encourage clear parallel program design using

    the message passing model. The first is the notion of a communicator. A communicator

    defines a group of processes that have the ability to communicate with one another. In

    this group of processes, each is assigned a unique rank, and they explicitly

    communicate with one another by their ranks.

    The foundation of communication is built upon the simple send and receive operations.

    A process may send a message to another process by providing the rank of the process

    and a unique tag to identify the message. The receiver can then post a receive for a

    message with a given tag (or it may not even care about the tag), and then handle the

    data accordingly. Communications such as this which involve one sender and

    receiver are known aspoint-to-pointcommunications.

    There are many cases where processes may need to communicate with everyone else.For example, when a master process needs to broadcast information to all of its worker

    processes. In this case, it would be cumbersome to write code that does all of the sends

    and receives. In fact, it would often not use the network in an optimal manner. MPI can

    handle a wide variety of these types ofcollective communications that involve all

    processes.

    Mixtures of point-to-point and collective communications can be used to create highlycomplex parallel programs. In fact, this functionality is so powerful that it is not even

    necessary to start describing the advanced mechanisms of MPI. We will save that until a

    later lesson. For now, you should work on installing MPI on your machine. If you

    already have MPI installed, great! You can head over to the MPI Hello World lesson.

    Installing MPICH2

    MPI is simply a standard which others follow in their implementation. Because of this,

    there are a wide variety of MPI implementations out there. One of the most popular

    implementations, MPICH2, will be used for all of the examples provided through thissite. Users are free to use any implementation they wish, but only instructions for

  • 8/22/2019 Clase 4 - Tutorial de MPI

    3/35

    installing MPICH2 will be provided. Furthermore, the scripts and code provided for the

    lessons are only guaranteed to execute and run with the lastest version of MPICH2.

    MPICH2 is a widely-used implementation of MPI that is developed primarily by

    Argonne National Laboratory in the United States. The main reason for choosing

    MPICH2 over other implementations is simply because of my familiarity with theinterface and because of my close relationship with Argonne National Laboratory. I also

    encourage others to check out OpenMPI, which is also a widely-used implementation.

    Installing MPICH2

    The latest version of MPICH2 is available here. The version that I will be using for all

    of the examples on the site is 1.4, which was released June 16, 2011. Go ahead and

    download the source code, uncompress the folder, and change into the MPICH2

    directory.

    Once doing this, you should be able to configure your installation by performing

    ./configure. I added a couple of parameters to my configuration to avoid building theMPI Fortran library. If you need to install MPICH2 to a local directory (for example, if

    you dont have root access to your machine), type ./configure --

    prefix=/installation/directory/path For more information about possible

    configuration parameters, type ./configure --help

    When configuration is done, it should say Configuration completed. Once this is

    through, it is time to build and install MPICH2 with make; sudo make install.

    If your build was successful, you should be able to type mpich2version and see

    something similar to this.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    4/35

    Hopefully your build finished successfully. If not, you may have issues with missingdependencies. For any issue, I highly recommend copying and pasting the error

    message directly into Google.

    Running an MPI Program

    Now that you have installed MPICH, whether its on your local machine or cluster, it is

    time to run a simple application. The MPI Hello World lesson goes over the basics of anMPI program, along with a guide on how to run MPICH2 for the first time.

    MPI Hello World

    In this lesson, I will show you a basic MPI Hello World application and also discuss

    how to run an MPI program. The lesson will cover the basics of initializing MPI and

    running an MPI job across several processes. This lesson is intended to work with

    installations of MPICH2 (specifically 1.4). If you have not installed MPICH2, please

    refer back to the installing MPICH2 lesson.

    MPI Hello WorldFirst of all, the source code for this lesson can be downloaded here. Download it, extract

    it, and change to the example directory. The directory should contain three files:

    makefile, mpi_hello_world.c, and run.perl.

    Open the mpi_hello_world.c source code. Below are some excerpts from the code.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    5/35

    #include

    int main(int argc, char** argv){

    // Initialize the MPI environment

    MPI_Init(NULL, NULL);

    // Get the number of processes

    int world_size;MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    // Get the rank of the process

    int world_rank;

    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    // Get the name of the processor

    char processor_name[MPI_MAX_PROCESSOR_NAME];

    int name_len;

    MPI_Get_processor_name(processor_name, &name_len);

    // Print off a hello world message

    printf("Hello world from processor %s, rank %d"" out of %d processors\n",

    processor_name, world_rank, world_size);

    // Finalize the MPI environment.

    MPI_Finalize();

    }

    You will notice that the first step to building an MPI program is including the MPI

    header files with #include . After this, the MPI environment must be

    initialized with MPI_Init(NULL, NULL). During MPI_Init, all of MPIs global and

    internal variables are constructed. For example, a communicator is formed around all of

    the processes that were spawned, and unique ranks are assigned to each process.Currently, MPI_Init takes two arguments that are not necessary, and the extra

    parameters are simply left as extra space in case future implementations might need

    them.

    After MPI_Init, there are two main functions that are called. These two functions are

    used in almost every single MPI program that you will write.

    MPI_Comm_size(MPI_Comm communicator, int* size) Returns the size ofa communicator. In our example, MPI_COMM_WORLD (which is constructed

    for us by MPI) encloses all of the processes in the job, so this call should return

    the amount of processes that were requested for the job.

    MPI_Comm_rank(MPI_Comm communicator, int* rank) Returns the rank ofa process in a communicator. Each process inside of a communicator is assigned

    an incremental rank starting from zero. The ranks of the processes are primarily

    used for identification purposes when sending and receiving messages.

    A miscellaneous and less-used function in this program is

    MPI_Get_processor_name(char* name, int* name_length), which can obtain the

    actual name of the processor on which the process is executing. The final call in this

    program, MPI_Finalize() is used to clean up the MPI environment. No more MPI

    calls can be made after this one.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    6/35

    Running MPI Hello World

    Now compile the example by typing make. My makefile looks for the MPICCenvironment variable. If you installed MPICH2 to a local directory, set your MPICC

    environment variable to point to your mpicc binary. The mpicc program in yourinstallation is really just a wrapper around gcc, and it makes compiling and linking all

    of the necessary MPI routines much easier.

    After your program is compiled, it is ready to be executed. Now comes the part where

    you might have to do some additional configuration. If you are running MPI programs

    on a cluster of nodes, you will have to set up a host file. If you are simply running MPI

    on a laptop or a single machine, disregard the next piece of information.

    The host file contains names of all of the computers on which your MPI job will

    execute. For ease of execution, you should be sure that all of these computers have SSH

    access, and you should also setup an authorized keys file to avoid a password prompt

    for SSH. My host file looks like this.

    For the run script that I have provided in the download, you should set an environment

    variable called MPI_HOSTS and have it point to your hosts file. My script will

    automatically include it in the command line when the MPI job is launched. If you do

    not need a hosts file, simply do not set the environment variable. Also, if you have a

    local installation of MPI, you should set the MPIRUN environment variable to point to

    the mpirun binary from the installation. After this, call ./run.perl mpi_hello_world

    to run the example application.

    As expected, the MPI program was launched across all of the hosts in my host file. Each

    process was assigned a unique rank, which was printed off along with the process name.

    As one can see from my example output, the output of the processes is in an arbitrary

    order since there is no synchronization involved before printing.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    7/35

    Notice how the script called mpirun. This is program that the MPI implementation uses

    to launch the job. Processes are spawned across all the hosts in the host file and the MPI

    program executes across each process. My script automatically supplies the -n flag to

    set the number of MPI processes to four. Try changing the run script and launching

    more processes! Dont accidentally crash your system though.

    Now you might be asking, My hosts are actually dual-core machines. How can I get

    MPI to spawn processes across the individual cores first before individual machines?

    The solution is pretty simple. Just modify your hosts file and place a colon and the

    number of cores per processor after the host name. For example, I specified that each of

    my hosts has two cores.

    When I execute the run script again, voila!, the MPI job spawns two processes on only

    two of my hosts.

    Up Next

    Now that you have a basic understanding of how an MPI program is executed, it is now

    time to learn fundamental point-to-point communication routines. In the next lesson, I

    cover basic sending and receiving routines in MPI. Feel free to also examine the

    beginner MPI tutorial for a complete reference of all of the beginning MPI lessons.

    MPI Send and Receive

    Sending and receiving are the two foundational concepts of MPI. Almost every single

    function in MPI can be implemented with basic send and receive calls. In this lesson, I

    will discuss how to use MPIs blocking sending and receiving functions, and I will also

    overview other basic concepts associated with transmitting data using MPI. The code

    for this tutorial is available here.

    Overview of Sending and Receiving with MPI

    MPIs send and receive calls operate in the following manner. First, processA decides a

    message needs to be sent to processB. Process A then packs up all of its necessary data

    into a buffer for process B. These buffers are often referred to as envelopes since thedata is being packed into a single message before transmission (similar to how letters

  • 8/22/2019 Clase 4 - Tutorial de MPI

    8/35

    are packed into envelopes before transmission to the post office). After the data is

    packed into a buffer, the communication device (which is often a network) is

    responsible for routing the message to the proper location. The location of the message

    is defined by the processs rank.

    Even though the message is routed to B, process B still has to acknowledge that it wantsto receive As data. Once it does this, the data has been transmitted. Process A is

    acknowledged that the data has been transmitted and may go back to work.

    Sometimes there are cases when A might have to send many different types of messages

    to B. Instead of B having to go through extra measures to differentiate all these

    messages, MPI allows senders and receivers to also specify message IDs with the

    message (known as tags). When process B only requests a message with a certain tag

    number, messages with different tags will be buffered by the network until B is ready

    for them.

    With these concepts in mind, lets look at the prototypes for the MPI sending andreceiving functions.

    MPI_Send(void* data, int count, MPI_Datatype datatype, intdestination, int tag, MPI_Comm communicator)

    MPI_Recv(void* data, int count, MPI_Datatype datatype, intsource, int tag, MPI_Comm communicator, MPI_Status* status)

    Although this might seem like a mouthful when reading all of the arguments, they

    become easier to remember since almost every MPI call uses similar syntax. The first

    argument is the data buffer. The second and third arguments describe the count and type

    of elements that reside in the buffer. MPI_Send sends the exact count of elements, andMPI_Recv will receive at most the count of elements (more on this in the next lesson).

    The fourth and fifth arguments specify the rank of the sending/receiving process and the

    tag of the message. The sixth argument specifies the communicator and the last

    argument (for MPI_Recv only) provides information about the received message.

    Elementary MPI Datatypes

    The MPI_Send and MPI_Recv functions utilize MPI Datatypes as a means to specify

    the structure of a message at a higher level. For example, if the process wishes to send

    one integer to another, it would use a count of one and a datatype of MPI_INT. The

    other elementary MPI datatypes are listed below with their equivalent C datatypes.

    MPI_CHAR char

    MPI_SHORT short int

    MPI_INT int

    MPI_LONG long int

    MPI_LONG_LONG long long int

    MPI_UNSIGNED_CHAR unsigned char

    MPI_UNSIGNED_SHORT unsigned short int

    MPI_UNSIGNED unsigned int

    MPI_UNSIGNED_LONG unsigned long int

  • 8/22/2019 Clase 4 - Tutorial de MPI

    9/35

    MPI_UNSIGNED_LONG_LONG unsigned long long int

    MPI_FLOAT float

    MPI_DOUBLE double

    MPI_LONG_DOUBLE long double

    MPI_BYTE

    For now, we will only make use of these datatypes in the beginner MPI tutorial. Once

    we have covered enough basics, you will learn how to create your own MPI datatypes

    for characterizing more complex types of messages.

    MPI Send / Recv Program

    The code for this tutorial is available here. Go ahead and download and extract the code.

    I refer the reader back to the MPI Hello World Lesson for instructions on how to use my

    code packages.

    The first example is in send_recv.c. Some of the major parts of the program are shown

    below.

    // Find out rank, size

    int world_rank;

    MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

    int world_size;

    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    int number;

    if(world_rank ==0){

    number =-1;MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);

    }elseif(world_rank ==1){

    MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,

    MPI_STATUS_IGNORE);

    printf("Process 1 received number %d from process 0\n",

    number);

    }

    MPI_Comm_rank and MPI_Comm_size are first used to determine the world size alongwith the rank of the process. Then process zero initializes a number to the value of

    negative one and sends this value to process one. As you can see in the else ifstatement,

    process one is calling MPI_Recv to receive the number. It also prints off the receivedvalue.

    Since we are sending and receiving exactly one integer, each process requests that one

    MPI_INT be sent/received. Each process also uses a tag number of zero to identify the

    message. The processes could have also used the predefined constantMPI_ANY_TAG

    for the tag number since only one type of message was being transmitted.

    Running the example program looks like this.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    10/35

    As expected, process one receives negative one from process zero.

    MPI Ping Pong Program

    The next example is a ping pong program. In this example, processes use MPI_Send

    and MPI_Recv to continually bounce messages off of each other until they decide to

    stop. Take a look at ping_pong.c in the example code download. The major portions of

    the code look like this.

    int ping_pong_count =0;

    int partner_rank =(world_rank +1)%2;

    while(ping_pong_count < PING_PONG_LIMIT){

    if(world_rank == ping_pong_count %2){

    // Increment the ping pong count before you send it

    ping_pong_count++;

    MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,

    MPI_COMM_WORLD);

    printf("%d sent and incremented ping_pong_count "

    "%d to %d\n", world_rank, ping_pong_count,

    partner_rank);

    }else{

    MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    printf("%d received ping_pong_count %d from %d\n",

    world_rank, ping_pong_count, partner_rank);

    }

    }

    This example is meant to be executed with only two processes. The processes firstdetermine their partner with some simple arithmetic. Aping_pong_countis initiated to

    zero and it is incremented at each ping pong step by the sending process. As the

    ping_pong_count is incremented, the processes take turns being the sender and receiver.Finally, after the limit is reached (ten in my code), the processes stop sending and

    receiving. The output of the example code will look something like this.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    11/35

    The output of the programs of others will likely be different. However, as you can see,

    process zero and one are both taking turns sending and receiving the ping pong counter

    to each other.

    Ring Program

    I have included one more example of MPI_Send and MPI_Recv using more than two

    processes. In this example, a value is passed around by all processes in a ring-likefashion. Take a look at ring.c in the example code download. The major portion of the

    code looks like this.

    int token;

    if(world_rank !=0){

    MPI_Recv(&token, 1, MPI_INT, world_rank -1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    printf("Process %d received token %d from process %d\n",

    world_rank, token, world_rank -1);

    }else{

    // Set the token's value if you are process 0

    token =-1;}

    MPI_Send(&token, 1, MPI_INT, (world_rank +1)% world_size,

    0, MPI_COMM_WORLD);

    // Now process 0 can receive from the last process.

    if(world_rank ==0){

    MPI_Recv(&token, 1, MPI_INT, world_size -1, 0,

    MPI_COMM_WORLD, MPI_STATUS_IGNORE);

    printf("Process %d received token %d from process %d\n",

    world_rank, token, world_size -1);

    }

    The ring program initializes a value from process zero, and the value is passed aroundevery single process. The program terminates when process zero receives the value

  • 8/22/2019 Clase 4 - Tutorial de MPI

    12/35

    from the last process. As you can see from the program, extra care is taken to assure that

    it doesnt deadlock. In other words, process zero makes sure that it has completed its

    first send before it tries to receive the value from the last process. All of the other

    processes simply call MPI_Recv (receiving from their neighboring lower process) and

    then MPI_Send (sending the value to their neighboring higher process) to pass the value

    along the ring.

    MPI_Send and MPI_Recv will block until the message has been transmitted. Because of

    this, the printfs should occur by the order in which the value is passed. Using five

    processes, the output should look like this.

    As we can see, process zero first sends a value of negative one to process one. This

    value is passed around the ring until it gets back to process zero.

    Up Next

    Now that you have a basic understanding of MPI_Send and MPI_Recv, it is now time to

    go a little bit deeper into these functions. In the next lesson, I cover how to probe and

    dynamically receive messages. Feel free to also examine the beginner MPI tutorial for a

    complete reference of all of the beginning MPI lessons.

    Dynamic Receiving with MPI Probe (and

    MPI Status)

    In the previous lesson, I discussed how to use MPI_Send and MPI_Recv to perform

    standard point-to-point communication. I only covered how to send messages in which

    the length of the message was known beforehand. Although it is possible to send the

    length of the message as a separate send / recv operation, MPI natively supports

    dynamic messages with just a few additional function calls. I will be going over how touse these functions in this lesson. The code for this tutorial is located here.

    The MPI_Status Structure

    As covered in the previous lesson, the MPI_Recv operation takes the address of an

    MPI_Status structure as an argument (which can be ignored with

    MPI_STATUS_IGNORE). If we pass an MPI_Status structure to the MPI_Recv

    function, it will be populated with additional information about the receive operation

    after it completes. The three primary pieces of information include:

  • 8/22/2019 Clase 4 - Tutorial de MPI

    13/35

    1. The rank of the sender. The rank of the sender is stored in the MPI_SOURCEelement of the structure. That is, if we declare an MPI_Status stat variable,

    the rank can be accessed with stat.MPI_SOURCE.

    2. The tag of the message. The tag of the message can be accessed by theMPI_TAG element of the structure (similar to MPI_SOURCE).

    3. The length of the message. The length of the message does not have apredefined element in the status structure. Instead, we have to find out the lengthof the message withMPI_Get_count(MPI_Status* status, MPI_Datatype datatype, int*

    count)

    where countis the total number ofdatatype elements that were received.

    Why would any of this information be necessary? It turns out that MPI_Recv can take

    MPI_ANY_SOURCE for the rank of the sender and MPI_ANY_TAG for the tag of the

    message. For this case, the MPI_Status structure is the only way to find out the actual

    sender and tag of the message. Furthermore, MPI_Recv is not guaranteed to receive theentire amount of elements passed as the argument to the function call. Instead, it

    receives the amount of elements that were sent to it (and returns an error if more

    elements were sent than the desired receive amount). The MPI_Get_count function is

    used to determine the actual receive amount.

    An Example of Querying the MPI_Status Structure

    The program that queries the MPI_Status structure, check_status.c, is provided in the

    example code. The program sends a random amount of numbers to a receiver, and the

    receiver then finds out how many numbers were sent. The main part of the code looks

    like this.

    constint MAX_NUMBERS =100;

    int numbers[MAX_NUMBERS];

    int number_amount;

    if(world_rank ==0){

    // Pick a random amont of integers to send to process one

    srand(time(NULL));

    number_amount =(rand()/(float)RAND_MAX)* MAX_NUMBERS;

    // Send the amount of integers to process one

    MPI_Send(numbers, number_amount, MPI_INT, 1, 0, MPI_COMM_WORLD);

    printf("0 sent %d numbers to 1\n", number_amount);

    }elseif(world_rank ==1){

    MPI_Status status;

    // Receive at most MAX_NUMBERS from process zero

    MPI_Recv(numbers, MAX_NUMBERS, MPI_INT, 0, 0, MPI_COMM_WORLD,

    &status);

    // After receiving the message, check the status to determine

    // how many numbers were actually received

    MPI_Get_count(&status, MPI_INT, &number_amount);

    // Print off the amount of numbers, and also print additional

    // information in the status object

    printf("1 received %d numbers from 0. Message source = %d, "

    "tag = %d\n",

    number_amount, status.MPI_SOURCE, status.MPI_TAG);

  • 8/22/2019 Clase 4 - Tutorial de MPI

    14/35

    }

    As we can see, process zero randomly sends up to MAX_NUMBERS integers to

    process one. Process one then calls MPI_Recv for a total of MAX_NUMBERS integers.

    Although process one is passing MAX_NUMBERS as the argument to MPI_Recv,

    process one will receive at most this amount of numbers. In the code, process one callsMPI_Get_count with MPI_INT as the datatype to find out how many integers were

    actually received. Along with printing off the size of the received message, process one

    also prints off the source and tag of the message by accessing the MPI_SOURCE and

    MPI_TAG elements of the status structure.

    As a clarification, the return value from MPI_Get_count is relative to the datatype

    which is passed. If the user were to use MPI_CHAR as the datatype, the returned

    amount would be four times as large (assuming an integer is four bytes and a char is one

    byte).

    If you run the check_status program, the output should look similar to this.

    As expected, process zero sends a random amount of integers to process one, which

    prints off information about the received message.

    Using MPI_Probe to Find Out the Message Size

    Now that you understand how the MPI_Status object works, we can now use it to our

    advantage a little bit more. Instead of posting a receive and simply providing a really

    large buffer to handle all possible sizes of messages (as we did in the last example), you

    can use MPI_Probe to query the message size before actually receiving it. The function

    prototype looks like this.

    MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status* status)MPI_Probe looks quite similar to MPI_Recv. In fact, you can think of MPI_Probe as an

    MPI_Recv that does everything but receive the message. Similar to MPI_Recv,

    MPI_Probe will block for a message with a matching tag and sender. When the message

    is available, it will fill the status structure with information. The user can then use

    MPI_Recv to receive the actual message.

    The provided code has an example of this in probe.c. Heres what the main source code

    looks like.

    int number_amount;if(world_rank ==0){

  • 8/22/2019 Clase 4 - Tutorial de MPI

    15/35

    constint MAX_NUMBERS =100;

    int numbers[MAX_NUMBERS];

    // Pick a random amont of integers to send to process one

    srand(time(NULL));

    number_amount =(rand()/(float)RAND_MAX)* MAX_NUMBERS;

    // Send the random amount of integers to process oneMPI_Send(numbers, number_amount, MPI_INT, 1, 0, MPI_COMM_WORLD);

    printf("0 sent %d numbers to 1\n", number_amount);

    }elseif(world_rank ==1){

    MPI_Status status;

    // Probe for an incoming message from process zero

    MPI_Probe(0, 0, MPI_COMM_WORLD, &status);

    // When probe returns, the status object has the size and other

    // attributes of the incoming message. Get the size of the message

    MPI_Get_count(&status, MPI_INT, &number_amount);

    // Allocate a buffer just big enough to hold the incoming numbers

    int* number_buf =(int*)malloc(sizeof(int)* number_amount);

    // Now receive the message with the allocated buffer

    MPI_Recv(number_buf, number_amount, MPI_INT, 0, 0, MPI_COMM_WORLD,

    MPI_STATUS_IGNORE);

    printf("1 dynamically received %d numbers from 0.\n",

    number_amount);

    free(number_buf);

    }

    Similar to the last example, process zero picks a random amount of numbers to send to

    process one. What is different in this example is that process one now calls MPI_Probe

    to find out how many elements process zero is trying to send (using MPI_Get_count).Process one then allocates a buffer of the proper size and receives the numbers. Running

    the code will look similar to this.

    Although this example is trivial, MPI_Probe forms the basis of many dynamic MPI

    applications. For example, master / slave programs will often make heavy use of

    MPI_Probe when exchanging variable-sized worker messages. As an exercise, make awrapper around MPI_Recv that uses MPI_Probe for any dynamic applications you

    might write. It makes the code look much nicer

    Next

    Do you feel comfortable using the standard blocking point-to-point communication

    routines? If so, then you already have the ability to write endless amounts of parallel

    applications! Lets look at a more advanced example of using the routines you have

    learned. Check out the application example using MPI_Send, MPI_Recv, and

    MPI_Probe.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    16/35

    Point-to-Point Communication

    Application Random WalkIts time to go through an application example using some of the concepts introduced in

    the sending and receiving tutorial and the MPI_Probe and MPI_Status lesson. The code

    for the application can be downloaded here. The application simulates a process which I

    refer to as random walking. The basic problem definition of a random walk is as

    follows. Given aMin,Max, and random walker W, make walker Wtake Srandom walks

    of arbitrary length to the right. If the process goes out of bounds, it wraps back around.

    Scan only move one unit to the right or left at a time.

    Although the application in itself is very basic, the parallelization of random walking

    can simulate the behavior of a wide variety of parallel applications. More on that later.

    For now, lets overview how to parallelize the random walk problem.

    Parallelization of the Random Walking Problem

    Our first task, which is pertinent to many parallel programs, is splitting the domain

    across processes. The random walk problem has a one-dimensional domain of sizeMax

    Min + 1 (sinceMax andMin are inclusive to the walker). Assuming that walkers canonly take integer-sized steps, we can easily partition the domain into near-equal-sized

    chunks across processes. For example, ifMin is 0 andMax is 20 and we have four

    processes, the domain would be split like this.

    The first three processes own five units of the domain while the last process takes thelast five units plus the one remaining unit.

    Once the domain has been partitioned, the application will initialize walkers. As

    explained earlier, a walker will take Swalks with a random total walk size. For

    example, if the walker takes a walk of size six on process zero (using the previous

    domain decomposition), the execution of the walker will go like this:

    1. The walker starts taking incremental steps. When it hits value four, however, ithas reached the end of the bounds of process zero. Process zero now has to

    communicate the walker to process one.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    17/35

    2. Process one receives the walker and continues walking until it has reached itstotal walk size of six. The walker can then proceed on a new random walk.

    In this example, Wonly had to be communicated one time from process zero to process

    one. IfWhad to take a longer walk, however, it may have needed to be passed through

    more processes along its path through the domain.

    Coding the Application using MPI_Send and MPI_Recv

    This application can be coded using MPI_Send and MPI_Recv. Before we begin

    looking at code, lets establish some preliminary characteristics and functions of the

    program:

    Each process determines their part of the domain. Each process initializes exactlyNwalkers, all which start at the first value of

    their local domain.

    Each walker has two associated integer values: the current position of the walkerand the number of steps left to take.

    Walkers start traversing through the domain and are passed to other processesuntil they have completed their walk.

    The processes terminate when all walkers have finished.Lets begin by writing code for the domain decomposition. The function will take in the

    total domain size and find the appropriate subdomain for the MPI process. It will also

    give any remainder of the domain to the final process. For simplicity, I just call

    MPI_Abort for any errors that are found. The function, called decompose_domain,

    looks like this:

    void decompose_domain(int domain_size, int world_rank,

    int world_size, int* subdomain_start,

    int* subdomain_size){if(world_size > domain_size){

    // Don't worry about this special case. Assume the domain size

    // is greater than the world size.

    MPI_Abort(MPI_COMM_WORLD, 1);

    }

    *subdomain_start = domain_size / world_size * world_rank;

    *subdomain_size = domain_size / world_size;

    if(world_rank == world_size -1){

    // Give remainder to last process

    *subdomain_size += domain_size % world_size;

    }

    }

  • 8/22/2019 Clase 4 - Tutorial de MPI

    18/35

    As you can see, the function splits the domain in even chunks, taking care of the case

    when a remainder is present. The function returns a subdomain start and a subdomain

    size.

    Next, we need to create a function that initializes walkers. We first define a walker

    structure that looks like this:

    typedefstruct{

    int location;

    int num_steps_left_in_walk;

    } Walker;

    Our initialization function, called initialize_walkers, takes the subdomain bounds and

    adds walkers to an incoming_walkers vector (by the way, this application is in C++).

    void initialize_walkers(int num_walkers_per_proc, int max_walk_size,

    int subdomain_start, int subdomain_size,vector* incoming_walkers){

    Walker walker;

    for(int i =0; i < num_walkers_per_proc; i++){

    // Initialize walkers in the middle of the subdomain

    walker.location= subdomain_start;

    walker.num_steps_left_in_walk=

    (rand()/(float)RAND_MAX)* max_walk_size;

    incoming_walkers->push_back(walker);

    }

    }

    After initialization, it is time to progress the walkers. Lets start off by making a

    walking function. This function is responsible for progressing the walker until it hasfinished its walk. If it goes out of local bounds, it is added to the outgoing_walkers

    vector.

    void walk(Walker* walker, int subdomain_start, int subdomain_size,

    int domain_size, vector* outgoing_walkers){

    while(walker->num_steps_left_in_walk >0){

    if(walker->location == subdomain_start + subdomain_size){

    // Take care of the case when the walker is at the end

    // of the domain by wrapping it around to the beginning

    if(walker->location == domain_size){

    walker->location =0;

    }outgoing_walkers->push_back(*walker);

    break;

    }else{

    walker->num_steps_left_in_walk--;

    walker->location++;

    }

    }

    }

    Now that we have established an initialization function (that populates an incoming

    walker list) and a walking function (that populates an outgoing walker list), we only

    need two more functions: a function that sends outgoing walkers and a function thatreceives incoming walkers. The sending function looks like this:

  • 8/22/2019 Clase 4 - Tutorial de MPI

    19/35

    void send_outgoing_walkers(vector* outgoing_walkers,

    int world_rank, int world_size){

    // Send the data as an array of MPI_BYTEs to the next process.

    // The last process sends to process zero.

    MPI_Send((void*)outgoing_walkers->data(),

    outgoing_walkers->size()*sizeof(Walker), MPI_BYTE,

    (world_rank +1)% world_size, 0, MPI_COMM_WORLD);// Clear the outgoing walkers

    outgoing_walkers->clear();

    }

    The function that receives incoming walkers should use MPI_Probe since it does not

    know beforehand how many walkers it will receive. This is what it looks like:

    void receive_incoming_walkers(vector* incoming_walkers,

    int world_rank, int world_size){

    // Probe for new incoming walkers

    MPI_Status status;

    // Receive from the process before you. If you are process zero,// receive from the last process

    int incoming_rank =

    (world_rank ==0)? world_size -1: world_rank -1;

    MPI_Probe(incoming_rank, 0, MPI_COMM_WORLD, &status);

    // Resize your incoming walker buffer based on how much data is

    // being received

    int incoming_walkers_size;

    MPI_Get_count(&status, MPI_BYTE, &incoming_walkers_size);

    incoming_walkers->resize(incoming_walkers_size /sizeof(Walker));

    MPI_Recv((void*)incoming_walkers->data(), incoming_walkers_size,

    MPI_BYTE, incoming_rank, 0, MPI_COMM_WORLD,

    MPI_STATUS_IGNORE);}

    Now we have established the main functions of the program. We have to tie all these

    function together as follows:

    1. Initialize the walkers.2. Progress the walkers with the walkfunction.3. Send out any walkers in the outgoing_walkers vector.4. Receive new walkers and put them in the incoming_walkers vector.5. Repeat steps two through four until all walkers have finished.

    The first attempt at writing this program is below. For now, we will not worry about

    how to determine when all walkers have finished. Before you look at the code, I must

    warn you this code is incorrect! With this in mind, lets look at my code and hopefully

    you can see what might be wrong with it.

    // Find your part of the domain

    decompose_domain(domain_size, world_rank, world_size,

    &subdomain_start, &subdomain_size);

    // Initialize walkers in your subdomain

    initialize_walkers(num_walkers_per_proc, max_walk_size,

    subdomain_start, subdomain_size,

    &incoming_walkers);

  • 8/22/2019 Clase 4 - Tutorial de MPI

    20/35

    while(!all_walkers_finished){// Determine walker completion later

    // Process all incoming walkers

    for(int i =0; i < incoming_walkers.size(); i++){

    walk(&incoming_walkers[i], subdomain_start, subdomain_size,

    domain_size, &outgoing_walkers);

    }

    // Send all outgoing walkers to the next process.

    send_outgoing_walkers(&outgoing_walkers, world_rank,

    world_size);

    // Receive all the new incoming walkers

    receive_incoming_walkers(&incoming_walkers, world_rank,

    world_size);

    }

    Everything looks normal, but the order of function calls has introduced a very likely

    scenario deadlock.

    Deadlock and Prevention

    According to Wikipedia, deadlock refers to a specific condition when two or more

    processes are each waiting for the other to release a resource, or more than two

    processes are waiting for resources in a circular chain. In our case, the above code

    will result in a circular chain of MPI_Send calls.

    It is worth noting that the above code will actually not deadlock most of the time.

    Although MPI_Send is a blocking call, the MPI specification says that MPI_Send

    blocks until the send buffer can be reclaimed. This means that MPI_Send will return

    when the network can buffer the message. If the sends eventually cant be buffered by

    the network, they will block until a matching receive is posted. In our case, there are

    enough small sends and frequent matching receives to not worry about deadlock,

    however, a big enough network buffer should never be assumed.

    Since we are only focusing on MPI_Send and MPI_Recv in this lesson, the best way to

    avoid the possible sending and receiving deadlock is to order the messaging such thatsends will have matching receives and vice versa. One easy way to do this is to change

    our loop around such that even-numbered processes send outgoing walkers before

    receiving walkers and odd-numbered processes do the opposite. Given two stages of

    execution, the sending and receiving will now look like this:

  • 8/22/2019 Clase 4 - Tutorial de MPI

    21/35

    Note Executing this with one process can still deadlock. To avoid this, simply

    dont perform sends and receives when using one process. You may be asking, does

    this still work with an odd number of processes? We can go through a similar diagram

    again with three processes:

    As you can see, at all three stages, there is at least one posted MPI_Send that matches aposted MPI_Recv, so we dont have to worry about the occurrence of deadlock.

    Determining Completion of All Walkers

    Now comes the final step of the program determining when every single walker has

    finished. Since walkers can walk for a random length, they can finish their journey on

    any process. Because of this, it is difficult for all processes to know when all walkers

    have finished without some sort of additional communication. One possible solution is

    to have process zero keep track of all of the walkers that have finished and then tell all

    the other processes when to terminate. This solution, however, is quite cumbersomesince each process would have to report any completed walkers to process zero and then

    also handle different types of incoming messages.

    For this lesson, we will keep things simple. Since we know the maximum distance that

    any walker can travel and the smallest total size it can travel for each pair of sends and

    receives (the subdomain size), we can figure out the amount of sends and receives each

    process should do before termination. Using this characteristic of the program along

    with our strategy to avoid deadlock, the final main part of the program looks like this:

    // Find your part of the domain

    decompose_domain(domain_size, world_rank, world_size,&subdomain_start, &subdomain_size);

    // Initialize walkers in your subdomain

  • 8/22/2019 Clase 4 - Tutorial de MPI

    22/35

    initialize_walkers(num_walkers_per_proc, max_walk_size,

    subdomain_start, subdomain_size,

    &incoming_walkers);

    // Determine the maximum amount of sends and receives needed to

    // complete all walkers

    int maximum_sends_recvs =max_walk_size /(domain_size / world_size)+1;

    for(int m =0; m < maximum_sends_recvs; m++){

    // Process all incoming walkers

    for(int i =0; i < incoming_walkers.size(); i++){

    walk(&incoming_walkers[i], subdomain_start, subdomain_size,

    domain_size, &outgoing_walkers);

    }

    // Send and receive if you are even and vice versa for odd

    if(world_rank %2==0){

    send_outgoing_walkers(&outgoing_walkers, world_rank,

    world_size);

    receive_incoming_walkers(&incoming_walkers, world_rank,world_size);

    }else{

    receive_incoming_walkers(&incoming_walkers, world_rank,

    world_size);

    send_outgoing_walkers(&outgoing_walkers, world_rank,

    world_size);

    }

    }

    Running the Application

    The code for the application can be downloaded here. In contrast to the other lessons,this code uses C++. When installing MPICH2, you also installed the C++ MPI compiler

    (unless you explicitly configured it otherwise). If you installed MPICH2 in a local

    directory, make sure that you have set your MPICXX environment variable to point to

    the correct mpicxx compiler in order to use my makefile.

    In my code, I have set up the run script to provide default values for the program: 100

    for the domain size, 500 for the maximum walk size, and 20 for the number of walkers

    per process. The run script should spawn five MPI processes, and the output should

    look similar to this:

  • 8/22/2019 Clase 4 - Tutorial de MPI

    23/35

    The output continues until processes finish all sending and receiving of all walkers.

    So Whats Next?

    If you have made it through this entire application and feel comfortable, then good! This

    application is quite advanced for a first real application. If you still dont feel

    comfortable with MPI_Send, MPI_Recv, and MPI_Probe, Id recommend going

    through some of the examples in my recommended books for more practice.

    Next, we will start learning about collective communication in MPI, so stay tuned!

    Also, at the beginning, I told you that the concepts of this program are applicable to

    many parallel programs. I dont want to leave you hanging, so I have included some

    additional reading material below for anyone that wishes to learn more. Enjoy

    ADDITIONAL READING

    Random Walking and Its Similarity to Parallel Particle Tracing

    The random walk problem that we just coded, although

    seemingly trivial, can actually form the basis of simulating

    many types of parallel applications. Some parallel applications

    in the scientific domain require many types of randomized sends

    and receives. One example application is parallel particle

    tracing.

    Parallel particle tracing is one of the primary methods that are

    used to visualize flow fields. Particles are inserted into the flow

    field and then traced along the flow using numerical integration

    techniques (such as Runge-Kutta). The traced paths can then be

    rendered for visualization purposes. One example rendering is of the tornado image at

    the top left.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    24/35

    Performing efficient parallel particle tracing can be very difficult. The main reason for

    this is because the direction in which particles travel can only be determined after each

    incremental step of the integration. Therefore, it is hard for processes to coordinate and

    balance all communication and computation. To understand this better, lets look at a

    typical parallelization of particle tracing.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    25/35

  • 8/22/2019 Clase 4 - Tutorial de MPI

    26/35

    In this illustration, we see that the domain is split among six process. Particles

    (sometimes referred to as seeds) are then placed in the subdomains (similar to how we

    placed walkers in subdomains), and then they begin tracing. When particles go out of

    bounds, they have to be exchanged with processes which have the proper subdomain.

    This process is repeated until the particles have either left the entire domain or have

    reached a maximum trace length.

    The parallel particle tracing problem can be solved with MPI_Send, MPI_Recv, and

    MPI_Probe in a similar manner to our application that we just coded. There are,

    however, much more sophisticated MPI routines that can get the job done more

    efficiently. We will talk about these in the coming lessons

    I hope you can now see at least one example of how the random walk problem is similar

    to other parallel applications. Stay tuned for more lessons and applications!

    MPI Broadcast and CollectiveCommunication

    So far in the beginner MPI tutorial, we have examined point-to-point communication,

    which is communication between two processes. This lesson is the start of the collective

    communication section. Collective communication is a method of communication

    which involves participation ofall processes in a communicator. In this lesson, we will

    discuss the implications of collective communication and go over a standard collective

    routine broadcasting. The code for the lesson can be downloaded here.

    Collective Communication and Synchronization Points

    One of the things to remember about collective communication is that it implies a

    synchronization pointamong processes. This means that all processes must reach a

    point in their code before they can all begin executing again.

    Before going into detail about collective communication routines, lets examine

    synchronization in more detail. As it turns out, MPI has a special function that is

    dedicated to synchronizing processes:

    MPI_Barrier(MPI_Comm communicator)

    The name of the function is quite descriptive the function forms a barrier, and no

    processes in the communicator can pass the barrier until all of them call the function.

    Heres an illustration. Imagine the horizontal axis represents execution of the program

    and the circles represent different processes:

  • 8/22/2019 Clase 4 - Tutorial de MPI

    27/35

    Process zero first calls MPI_Barrier at the first time snapshot (T 1). While process zerois hung up at the barrier, process one and three eventually make it (T 2). When process

    two finally makes it to the barrier (T 3), all of the processes then begin execution again

    (T 4).

    MPI_Barrier can be useful for many things. One of the primary uses of MPI_Barrier is

    to synchronize a program so that portions of the parallel code can be timed accurately.

    Want to know how MPI_Barrier is implemented? Sure you do Do you remember the

    ring program from the MPI_Send and MPI_Recv tutorial? To refresh your memory, we

    wrote a program that passed a token around all processes in a ring-like fashion. This

    type of program is one of the simplest methods to implement a barrier since a token

    cant be passed around completely until all processes work together.

    One final note about synchronization Always remember that every collective call you

    make is synchronized. In other words, if you cant successfully complete an

    MPI_Barrier, then you also cant successfully complete any collective call. If you try to

    call MPI_Barrier or other collective routines without ensuring all processes in the

    communicator will also call it, your program will idle. This can be very confusing forbeginners, so be careful!

    Broadcasting with MPI_Bcast

    A broadcastis one of the standard collective communication techniques. During a

    broadcast, one process sends the same data to all processes in a communicator. One of

    the main uses of broadcasting is to send out user input to a parallel program, or send out

    configuration parameters to all processes.

    The communication pattern of a broadcast looks like this:

  • 8/22/2019 Clase 4 - Tutorial de MPI

    28/35

    In this example, process zero is the rootprocess, and it has the initial copy of data. All

    of the other processes receive the copy of data.

    In MPI, broadcasting can be accomplished by using MPI_Bcast. The function prototype

    looks like this:

    MPI_Bcast(void* data, int count, MPI_Datatype datatype, introot, MPI_Comm communicator)

    Although the root process and receiver processes do different jobs, they all call the same

    MPI_Bcast function. When the root process (in our example, it was process zero) calls

    MPI_Bcast, the data variable will be sent to all other processes. When all of the receiver

    processes call MPI_Bcast, the data variable will be filled in with the data from the root

    process.

    Broadcasting with MPI_Send and MPI_Recv

    At first, it might seem that MPI_Bcast is just a simple wrapper around MPI_Send and

    MPI_Recv. In fact, we can make this wrapper function right now. Our function, called

    my_bcastcan be downloaded in the example code for this lesson (my_bcast.c). It takesthe same arguments as MPI_Bcast and looks like this:

    void my_bcast(void* data, int count, MPI_Datatype datatype, int root,

    MPI_Comm communicator){

    int world_rank;

    MPI_Comm_rank(communicator, &world_rank);

    int world_size;

    MPI_Comm_size(communicator, &world_size);

    if(world_rank == root){

    // If we are the root process, send our data to everyone

    int i;

    for(i =0; i < world_size; i++){

    if(i != world_rank){

    MPI_Send(data, count, datatype, i, 0, communicator);

    }

    }

    }else{

    // If we are a receiver process, receive the data from the root

    MPI_Recv(data, count, datatype, root, 0, communicator,

    MPI_STATUS_IGNORE);

    }

    }

  • 8/22/2019 Clase 4 - Tutorial de MPI

    29/35

    The root process sends the data to everyone else while the others receive from the root

    process. Easy, right? If you download the code and run the program, the program will

    print output like this:

    Believe it or not, our function is actually very inefficient! Imagine that each process has

    only one outgoing/incoming network link. Our function is only using one network link

    from process zero to send all the data. A smarter implementation is a tree-based

    communication algorithm that can use more of the available network links at once. For

    example:

    In this illustration, process zero starts off with the data and sends it to process one.Similar to our previous example, process zero also sends the data to process two in the

    second stage. The difference with this example is that process one is now helping out

    the root process by forwarding the data to process three. During the second stage, two

    network connections are being utilized at a time. The network utilization doubles at

    every subsequent stage of the tree communication until all processes have received the

    data.

    Do you think you can code this? Writing this code is a bit outside of the purpose of the

    lesson. If you are feeling brave, Parallel Programming with MPI is an excellent book

    with a complete example of the problem with code.

    Comparison of MPI_Bcast with MPI_Send and MPI_Recv

    The MPI_Bcast implementation utilizes a similar tree broadcast algorithm for good

    network utilization. How does our broadcast function compare to MPI_Bcast? We can

    run compare_bcast, an example program included in the lesson code. Before looking at

    the code, lets first go over one of MPIs timing functions MPI_Wtime(). MPI_Wtime

    takes no arguments, and it simply returns a floating-point number of seconds since a set

    time in the past. Similar to Cs time function, you can call multiple MPI_Wtime

    functions throughout your program and subtract their differences to obtain timing of

    code segments.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    30/35

    Lets take a look of our code that compares my_bcast to MPI_Bcast

    for(i =0; i < num_trials; i++){

    // Time my_bcast

    // Synchronize before starting timing

    MPI_Barrier(MPI_COMM_WORLD);

    total_my_bcast_time -= MPI_Wtime();

    my_bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);

    // Synchronize again before obtaining final time

    MPI_Barrier(MPI_COMM_WORLD);

    total_my_bcast_time += MPI_Wtime();

    // Time MPI_Bcast

    MPI_Barrier(MPI_COMM_WORLD);

    total_mpi_bcast_time -= MPI_Wtime();

    MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);

    MPI_Barrier(MPI_COMM_WORLD);

    total_mpi_bcast_time += MPI_Wtime();

    }

    In this code, num_trials is a variable stating how many timing experiments should be

    executed. We keep track of the accumulated time of both functions in two different

    variables. The average times are printed at the end of the program. To see the entire

    code, just download the lesson code and look at compare_bcast.c.

    When you use the run script to execute the code, the output will look similar to this.

    The run script executes the code using 16 processors, 100,000 integers per broadcast,

    and 10 trial runs for timing results. As you can see, my experiment using 16 processors

    connected via ethernet shows significant timing differences between our naive

    implementation and MPIs implementation. Here is what the timing results look like at

    all scales.

    Processors my_bcast MPI_Bcast

    2 0.0344 0.0344

    4 0.1025 0.0817

    8 0.2385 0.1084

    16 0.5109 0.1296

    As you can see, there is no difference between the two implementations at two

    processors. This is because MPI_Bcasts tree implementation does not provide any

    additional network utilization when using two processors. However, the differences can

    clearly be observed when going up to even as little as 16 processors.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    31/35

    Try running the code yourself and experiment at larger scales!

    Conclusions / Up Next

    Feel a little better about collective routines? In the next MPI tutorial, I go over other

    essential collective communication routines gathering and scattering.

    For all beginner lessons, go the the beginner MPI tutorial.

    MPI Scatter, Gather, and Allgather

    In the previous lesson, we went over the essentials of collective communication. We

    covered the most basic collective communication routine MPI_Bcast. In this lesson,

    we are going to expand on collective communication routines by going over two very

    important routines MPI_Scatter and MPI_Gather. We will also cover a variant of

    MPI_Gather, known as MPI_Allgather. The code for this tutorial is available here.

    An Introduction to MPI_Scatter

    MPI_Scatter is a collective routine that is very similar to MPI_Bcast (If you are

    unfamiliar with these terms, please read the previous lesson). MPI_Scatter involves a

    designated root process sending data to all processes in a communicator. The primary

    difference between MPI_Bcast and MPI_Scatter is small but important. MPI_Bcast

    sends the same piece of data to all processes while MPI_Scatter sends chunks of an

    array to different processes. Check out the illustration below for further clarification.

    In the illustration, MPI_Bcast takes a single data element at the root process (the redbox) and copies it to all other processes. MPI_Scatter takes an array of elements and

    distributes the elements in the order of process rank. The first element (in red) goes to

    process zero, the second element (in green) goes to process one, and so on. Although

  • 8/22/2019 Clase 4 - Tutorial de MPI

    32/35

    the root process (process zero) contains the entire array of data, MPI_Scatter will copy

    the appropriate element into the receiving buffer of the process. Here is what the

    function prototype of MPI_Scatter looks like.

    MPI_Scatter(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatyperecv_datatype, int root, MPI_Comm communicator)

    Yes, the function looks big and scary, but lets examine it in more detail. The first

    parameter, send_data, is an array of data that resides on the root process. The second

    and third parameters, send_countand send_datatype, dictate how many elements of a

    specific MPI Datatype will be sent to each process. Ifsend_countis one and

    send_datatype is MPI_INT, then process zero gets the first integer of the array, process

    one gets the second integer, and so on. Ifsend_countis two, then process zero gets the

    first and second integers, process one gets the third and fourth, and so on. In practice,

    send_countis often equal to the number of elements in the array divided by the number

    of processes. Whats that you say? The number of elements isnt divisible by thenumber of processes? Dont worry, we will cover that in a later lesson

    The receiving parameters of the function prototype are nearly identical in respect to the

    sending parameters. The recv_data parameter is a buffer of data that can hold

    recv_countelements that have a datatype ofrecv_datatype. The last parameters, root

    and communicator, indicate the root process that is scattering the array of data and the

    communicator in which the processes reside.

    An Introduction to MPI_Gather

    MPI_Gather is the inverse of MPI_Scatter. Instead of spreading elements from oneprocess to many processes, MPI_Gather takes elements from many processes and

    gathers them to one single process. This routine is highly useful to many parallel

    algorithms, such as parallel sorting and searching. Below is a simple illustration of this

    algorithm.

    Similar to MPI_Scatter, MPI_Gather takes elements from each process and gathers

    them to the root process. The elements are ordered by the rank of the process from

    which they were received. The function prototype for MPI_Gather is identical to that of

    MPI_Scatter.

    MPI_Gather(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatyperecv_datatype, int root, MPI_Comm communicator)

  • 8/22/2019 Clase 4 - Tutorial de MPI

    33/35

    In MPI_Gather, only the root process needs to have a valid receive buffer. All other

    calling processes can pass NULL for recv_data. Also, dont forget that the recv_count

    parameter is the count of elements receivedper process, not the total summation of

    counts from all processes. This can often confuse beginning MPI programmers.

    Computing Average of Numbers with MPI_Scatter and MPI_Gather

    In the code for this lesson, I have provided an example program that computes the

    average across all numbers in an array. The program is in avg.c. Although the program

    is quite simple, it demonstrates how one can use MPI to divide work across processes,

    perform computation on subsets of data, and then aggregate the smaller pieces into the

    final answer. The program takes the following steps:

    1. Generate a random array of numbers on the root process (process 0).2. Scatter the numbers to all processes, giving each process an equal amount of

    numbers.

    3. Each process computes the average of their subset of the numbers.4. Gather all averages to the root process. The root process then computes the

    average of these numbers to get the final average.

    The main part of the code with the MPI calls looks like this:

    if(world_rank ==0){

    rand_nums = create_rand_nums(elements_per_proc * world_size);

    }

    // Create a buffer that will hold a subset of the random numbers

    float*sub_rand_nums =malloc(sizeof(float)* elements_per_proc);

    // Scatter the random numbers to all processes

    MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums,

    elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD);

    // Compute the average of your subset

    float sub_avg = compute_avg(sub_rand_nums, elements_per_proc);

    // Gather all partial averages down to the root process

    float*sub_avgs =NULL;

    if(world_rank ==0){

    sub_avgs =malloc(sizeof(float)* world_size);

    }

    MPI_Gather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0,

    MPI_COMM_WORLD);

    // Compute the total average of all numbers.

    if(world_rank ==0){

    float avg = compute_avg(sub_avgs, world_size);

    }

    At the beginning of the code, the root process creates an array of random numbers.

    When MPI_Scatter is called, each process now contains elements_per_proc elements of

    the original data. Each process computes the average of their subset of data and then the

    root process gathers each individual average. The total average is computed on this

    much smaller array of numbers.

  • 8/22/2019 Clase 4 - Tutorial de MPI

    34/35

    Using the run script included in the code for this lesson, the output of your program

    should be similar to the following. Note that the numbers are randomly generated, so

    your final result might be different from mine.

    MPI_Allgather and Modification of Average Program

    So far, we have covered two MPI routines that perform many-to-one or one-to-many

    communication patterns, which simply means that many processes send/receive to one

    process. Oftentimes it is useful to be able to send many elements to many processes (i.e.

    a many-to-many communication pattern). MPI_Allgather has this characteristic.

    Given a set of elements distributed across all processes, MPI_Allgather will gather all

    of the elements to all the processes. In the most basic sense, MPI_Allgather is an

    MPI_Gather followed by an MPI_Bcast. The illustration below shows how data is

    distributed after a call to MPI_Allgather.

    Just like MPI_Gather, the elements from each process are gathered in order of their

    rank, except this time the elements are gathered to all processes. Pretty easy, right? The

    function declaration for MPI_Allgather is almost identical to MPI_Gather with the

    difference that there is no root process in MPI_Allgather.

    MPI_Allgather(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatype

    recv_datatype, MPI_Comm communicator)

    I have modified the average computation code to use MPI_Allgather. You can view the

    source in all_avg.c from the lesson code. The main difference in the code is shown

    below.

    // Gather all partial averages down to all the processes

    float*sub_avgs =(float*)malloc(sizeof(float)* world_size);

    MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT,

    MPI_COMM_WORLD);

  • 8/22/2019 Clase 4 - Tutorial de MPI

    35/35

    // Compute the total average of all numbers.

    float avg = compute_avg(sub_avgs, world_size);

    The partial averages are now gathered to everyone using MPI_Allgather. The averages

    are now printed off from all of the processes. Example output of the program should

    look like the following:

    As you may have noticed, the only difference between all_avg.c and avg.c is thatall_avg.c prints the average across all processes with MPI_Allgather.

    Up Next

    In the next lesson, I will cover some of the more complex collective communication

    algorithms. Stay tuned! Feel free to leave any comments or questions about the lesson.

    For all beginner lessons, go the the beginner MPI tutorial.