clase 4 - tutorial de mpi

8/22/2019 Clase 4 - Tutorial de MPI

1/35

Beginner MPI Tutorial

Welcome to the MPI tutorial for beginners! In this tutorial, you will learn all of the

basic concepts of MPI by going through various examples. The different parts of the

tutorial are meant to build on top of one another. If you feel lost during any lesson, feelfree to leave a comment on the post explaining your dilemma. Either myself or another

MPI expert will likely be able to get back to you soon.

This beginning tutorial assumes that the reader has a general knowledge of how parallel

programming works, has experience with working on a Linux system, and can also

understand the C programming language.

Introduction

MPI Introduction Installing MPICH2 Running an MPI Hello World Application

Blocking Point-to-Point Communication

MPI Send and Receive Dynamic Receiving with MPI_Probe (and MPI_Status) Point-to-Point Communication Application Example Random Walk

Collective Communication

MPI Broadcast and Collective Communication MPI Scatter, Gather, and Allgather

MPI Introduction

The Message Passing Interface (MPI) first appeared as a standard in 1994 for

performing distributed-memory parallel computing. Since then, it has become the

dominant model for high-performance computing, and it is used widely in research,

academia, and industry.

The functionality of MPI is extremely rich, offering the programmer with the ability to

perform: point-to-point communication, collective communication, one-sided

communication, parallel I/O, and even dynamic process management. These terms

probably sound quite strange to a beginner, but by the end of all of the tutorials, the

terminology will be common place.

Before starting the tutorials, familiarize yourself with the basic concepts below. These

are all related to MPI, and many of these concepts are referred to throughout the

tutorials.

The Message Passing Model


2/35

The message passing model is a model of parallel programming in which processes can

only share data by messages. MPI adheres to this model. If one process wishes to

transfer data to another, it must initiate a message and explicitly send data to that

process. The other process will also have to explicitly receive the other message (except

in the case of one-sided communication, but we will get to that later).

Forcing communication to happen in this way offers several advantages for parallel

programs. For example, the message passing model is portable across a wide range of

architectures. An MPI program can run across computers that are spread across the

globe and connected by the internet, or it can execute on tightly-coupled clusters. An

MPI program can even run on the cores of a shared-memory processor and pass

messages through the shared memory. All of these details are abstracted by the

interface. The debugging of these programs is often easier too, since one does not need

to worry about processes overwriting the address space of another.

MPIs Design for the Message Passing Model

MPI has a couple classic concepts that encourage clear parallel program design using

the message passing model. The first is the notion of a communicator. A communicator

defines a group of processes that have the ability to communicate with one another. In

this group of processes, each is assigned a unique rank, and they explicitly

communicate with one another by their ranks.

The foundation of communication is built upon the simple send and receive operations.

A process may send a message to another process by providing the rank of the process

and a unique tag to identify the message. The receiver can then post a receive for a

message with a given tag (or it may not even care about the tag), and then handle the

data accordingly. Communications such as this which involve one sender and

receiver are known aspoint-to-pointcommunications.

There are many cases where processes may need to communicate with everyone else.For example, when a master process needs to broadcast information to all of its worker

processes. In this case, it would be cumbersome to write code that does all of the sends

and receives. In fact, it would often not use the network in an optimal manner. MPI can

handle a wide variety of these types ofcollective communications that involve all

processes.

Mixtures of point-to-point and collective communications can be used to create highlycomplex parallel programs. In fact, this functionality is so powerful that it is not even

necessary to start describing the advanced mechanisms of MPI. We will save that until a

later lesson. For now, you should work on installing MPI on your machine. If you

already have MPI installed, great! You can head over to the MPI Hello World lesson.

Installing MPICH2

MPI is simply a standard which others follow in their implementation. Because of this,

there are a wide variety of MPI implementations out there. One of the most popular

implementations, MPICH2, will be used for all of the examples provided through thissite. Users are free to use any implementation they wish, but only instructions for


3/35

installing MPICH2 will be provided. Furthermore, the scripts and code provided for the

lessons are only guaranteed to execute and run with the lastest version of MPICH2.

MPICH2 is a widely-used implementation of MPI that is developed primarily by

Argonne National Laboratory in the United States. The main reason for choosing

MPICH2 over other implementations is simply because of my familiarity with theinterface and because of my close relationship with Argonne National Laboratory. I also

encourage others to check out OpenMPI, which is also a widely-used implementation.

Installing MPICH2

The latest version of MPICH2 is available here. The version that I will be using for all

of the examples on the site is 1.4, which was released June 16, 2011. Go ahead and

download the source code, uncompress the folder, and change into the MPICH2

directory.

Once doing this, you should be able to configure your installation by performing

./configure. I added a couple of parameters to my configuration to avoid building theMPI Fortran library. If you need to install MPICH2 to a local directory (for example, if

you dont have root access to your machine), type ./configure --

prefix=/installation/directory/path For more information about possible

configuration parameters, type ./configure --help

When configuration is done, it should say Configuration completed. Once this is

through, it is time to build and install MPICH2 with make; sudo make install.

If your build was successful, you should be able to type mpich2version and see

something similar to this.


4/35

Hopefully your build finished successfully. If not, you may have issues with missingdependencies. For any issue, I highly recommend copying and pasting the error

message directly into Google.

Running an MPI Program

Now that you have installed MPICH, whether its on your local machine or cluster, it is

time to run a simple application. The MPI Hello World lesson goes over the basics of anMPI program, along with a guide on how to run MPICH2 for the first time.

MPI Hello World

In this lesson, I will show you a basic MPI Hello World application and also discuss

how to run an MPI program. The lesson will cover the basics of initializing MPI and

running an MPI job across several processes. This lesson is intended to work with

installations of MPICH2 (specifically 1.4). If you have not installed MPICH2, please

refer back to the installing MPICH2 lesson.

MPI Hello WorldFirst of all, the source code for this lesson can be downloaded here. Download it, extract

it, and change to the example directory. The directory should contain three files:

makefile, mpi_hello_world.c, and run.perl.

Open the mpi_hello_world.c source code. Below are some excerpts from the code.


5/35

#include

int main(int argc, char** argv){

// Initialize the MPI environment

MPI_Init(NULL, NULL);

// Get the number of processes

int world_size;MPI_Comm_size(MPI_COMM_WORLD, &world_size);

// Get the rank of the process

int world_rank;

MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

// Get the name of the processor

char processor_name[MPI_MAX_PROCESSOR_NAME];

int name_len;

MPI_Get_processor_name(processor_name, &name_len);

// Print off a hello world message

printf("Hello world from processor %s, rank %d"" out of %d processors\n",

processor_name, world_rank, world_size);

// Finalize the MPI environment.

MPI_Finalize();

}

You will notice that the first step to building an MPI program is including the MPI

header files with #include . After this, the MPI environment must be

initialized with MPI_Init(NULL, NULL). During MPI_Init, all of MPIs global and

internal variables are constructed. For example, a communicator is formed around all of

the processes that were spawned, and unique ranks are assigned to each process.Currently, MPI_Init takes two arguments that are not necessary, and the extra

parameters are simply left as extra space in case future implementations might need

them.

After MPI_Init, there are two main functions that are called. These two functions are

used in almost every single MPI program that you will write.

MPI_Comm_size(MPI_Comm communicator, int* size) Returns the size ofa communicator. In our example, MPI_COMM_WORLD (which is constructed

for us by MPI) encloses all of the processes in the job, so this call should return

the amount of processes that were requested for the job.

MPI_Comm_rank(MPI_Comm communicator, int* rank) Returns the rank ofa process in a communicator. Each process inside of a communicator is assigned

an incremental rank starting from zero. The ranks of the processes are primarily

used for identification purposes when sending and receiving messages.

A miscellaneous and less-used function in this program is

MPI_Get_processor_name(char* name, int* name_length), which can obtain the

actual name of the processor on which the process is executing. The final call in this

program, MPI_Finalize() is used to clean up the MPI environment. No more MPI

calls can be made after this one.


6/35

Running MPI Hello World

Now compile the example by typing make. My makefile looks for the MPICCenvironment variable. If you installed MPICH2 to a local directory, set your MPICC

environment variable to point to your mpicc binary. The mpicc program in yourinstallation is really just a wrapper around gcc, and it makes compiling and linking all

of the necessary MPI routines much easier.

After your program is compiled, it is ready to be executed. Now comes the part where

you might have to do some additional configuration. If you are running MPI programs

on a cluster of nodes, you will have to set up a host file. If you are simply running MPI

on a laptop or a single machine, disregard the next piece of information.

The host file contains names of all of the computers on which your MPI job will

execute. For ease of execution, you should be sure that all of these computers have SSH

access, and you should also setup an authorized keys file to avoid a password prompt

for SSH. My host file looks like this.

For the run script that I have provided in the download, you should set an environment

variable called MPI_HOSTS and have it point to your hosts file. My script will

automatically include it in the command line when the MPI job is launched. If you do

not need a hosts file, simply do not set the environment variable. Also, if you have a

local installation of MPI, you should set the MPIRUN environment variable to point to

the mpirun binary from the installation. After this, call ./run.perl mpi_hello_world

to run the example application.

As expected, the MPI program was launched across all of the hosts in my host file. Each

process was assigned a unique rank, which was printed off along with the process name.

As one can see from my example output, the output of the processes is in an arbitrary

order since there is no synchronization involved before printing.


7/35

Notice how the script called mpirun. This is program that the MPI implementation uses

to launch the job. Processes are spawned across all the hosts in the host file and the MPI

program executes across each process. My script automatically supplies the -n flag to

set the number of MPI processes to four. Try changing the run script and launching

more processes! Dont accidentally crash your system though.

Now you might be asking, My hosts are actually dual-core machines. How can I get

MPI to spawn processes across the individual cores first before individual machines?

The solution is pretty simple. Just modify your hosts file and place a colon and the

number of cores per processor after the host name. For example, I specified that each of

my hosts has two cores.

When I execute the run script again, voila!, the MPI job spawns two processes on only

two of my hosts.

Up Next

Now that you have a basic understanding of how an MPI program is executed, it is now

time to learn fundamental point-to-point communication routines. In the next lesson, I

cover basic sending and receiving routines in MPI. Feel free to also examine the

beginner MPI tutorial for a complete reference of all of the beginning MPI lessons.

MPI Send and Receive

Sending and receiving are the two foundational concepts of MPI. Almost every single

function in MPI can be implemented with basic send and receive calls. In this lesson, I

will discuss how to use MPIs blocking sending and receiving functions, and I will also

overview other basic concepts associated with transmitting data using MPI. The code

for this tutorial is available here.

Overview of Sending and Receiving with MPI

MPIs send and receive calls operate in the following manner. First, processA decides a

message needs to be sent to processB. Process A then packs up all of its necessary data

into a buffer for process B. These buffers are often referred to as envelopes since thedata is being packed into a single message before transmission (similar to how letters


8/35

are packed into envelopes before transmission to the post office). After the data is

packed into a buffer, the communication device (which is often a network) is

responsible for routing the message to the proper location. The location of the message

is defined by the processs rank.

Even though the message is routed to B, process B still has to acknowledge that it wantsto receive As data. Once it does this, the data has been transmitted. Process A is

acknowledged that the data has been transmitted and may go back to work.

Sometimes there are cases when A might have to send many different types of messages

to B. Instead of B having to go through extra measures to differentiate all these

messages, MPI allows senders and receivers to also specify message IDs with the

message (known as tags). When process B only requests a message with a certain tag

number, messages with different tags will be buffered by the network until B is ready

for them.

With these concepts in mind, lets look at the prototypes for the MPI sending andreceiving functions.

MPI_Send(void* data, int count, MPI_Datatype datatype, intdestination, int tag, MPI_Comm communicator)

MPI_Recv(void* data, int count, MPI_Datatype datatype, intsource, int tag, MPI_Comm communicator, MPI_Status* status)

Although this might seem like a mouthful when reading all of the arguments, they

become easier to remember since almost every MPI call uses similar syntax. The first

argument is the data buffer. The second and third arguments describe the count and type

of elements that reside in the buffer. MPI_Send sends the exact count of elements, andMPI_Recv will receive at most the count of elements (more on this in the next lesson).

The fourth and fifth arguments specify the rank of the sending/receiving process and the

tag of the message. The sixth argument specifies the communicator and the last

argument (for MPI_Recv only) provides information about the received message.

Elementary MPI Datatypes

The MPI_Send and MPI_Recv functions utilize MPI Datatypes as a means to specify

the structure of a message at a higher level. For example, if the process wishes to send

one integer to another, it would use a count of one and a datatype of MPI_INT. The

other elementary MPI datatypes are listed below with their equivalent C datatypes.

MPI_CHAR char

MPI_SHORT short int

MPI_INT int

MPI_LONG long int

MPI_LONG_LONG long long int

MPI_UNSIGNED_CHAR unsigned char

MPI_UNSIGNED_SHORT unsigned short int

MPI_UNSIGNED unsigned int

MPI_UNSIGNED_LONG unsigned long int


9/35

MPI_UNSIGNED_LONG_LONG unsigned long long int

MPI_FLOAT float

MPI_DOUBLE double

MPI_LONG_DOUBLE long double

MPI_BYTE

For now, we will only make use of these datatypes in the beginner MPI tutorial. Once

we have covered enough basics, you will learn how to create your own MPI datatypes

for characterizing more complex types of messages.

MPI Send / Recv Program

The code for this tutorial is available here. Go ahead and download and extract the code.

I refer the reader back to the MPI Hello World Lesson for instructions on how to use my

code packages.

The first example is in send_recv.c. Some of the major parts of the program are shown

below.

// Find out rank, size

int world_rank;

MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);

int world_size;

MPI_Comm_size(MPI_COMM_WORLD, &world_size);

int number;

if(world_rank ==0){

number =-1;MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);

}elseif(world_rank ==1){

MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);

printf("Process 1 received number %d from process 0\n",

number);

}

MPI_Comm_rank and MPI_Comm_size are first used to determine the world size alongwith the rank of the process. Then process zero initializes a number to the value of

negative one and sends this value to process one. As you can see in the else ifstatement,

process one is calling MPI_Recv to receive the number. It also prints off the receivedvalue.

Since we are sending and receiving exactly one integer, each process requests that one

MPI_INT be sent/received. Each process also uses a tag number of zero to identify the

message. The processes could have also used the predefined constantMPI_ANY_TAG

for the tag number since only one type of message was being transmitted.

Running the example program looks like this.


10/35

As expected, process one receives negative one from process zero.

MPI Ping Pong Program

The next example is a ping pong program. In this example, processes use MPI_Send

and MPI_Recv to continually bounce messages off of each other until they decide to

stop. Take a look at ping_pong.c in the example code download. The major portions of

the code look like this.

int ping_pong_count =0;

int partner_rank =(world_rank +1)%2;

while(ping_pong_count < PING_PONG_LIMIT){

if(world_rank == ping_pong_count %2){

// Increment the ping pong count before you send it

ping_pong_count++;

MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,

MPI_COMM_WORLD);

printf("%d sent and incremented ping_pong_count "

"%d to %d\n", world_rank, ping_pong_count,

partner_rank);

}else{

MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0,

MPI_COMM_WORLD, MPI_STATUS_IGNORE);

printf("%d received ping_pong_count %d from %d\n",

world_rank, ping_pong_count, partner_rank);

}

}

This example is meant to be executed with only two processes. The processes firstdetermine their partner with some simple arithmetic. Aping_pong_countis initiated to

zero and it is incremented at each ping pong step by the sending process. As the

ping_pong_count is incremented, the processes take turns being the sender and receiver.Finally, after the limit is reached (ten in my code), the processes stop sending and

receiving. The output of the example code will look something like this.


11/35

The output of the programs of others will likely be different. However, as you can see,

process zero and one are both taking turns sending and receiving the ping pong counter

to each other.

Ring Program

I have included one more example of MPI_Send and MPI_Recv using more than two

processes. In this example, a value is passed around by all processes in a ring-likefashion. Take a look at ring.c in the example code download. The major portion of the

code looks like this.

int token;

if(world_rank !=0){

MPI_Recv(&token, 1, MPI_INT, world_rank -1, 0,


printf("Process %d received token %d from process %d\n",

world_rank, token, world_rank -1);

}else{

// Set the token's value if you are process 0

token =-1;}

MPI_Send(&token, 1, MPI_INT, (world_rank +1)% world_size,

0, MPI_COMM_WORLD);

// Now process 0 can receive from the last process.

if(world_rank ==0){

MPI_Recv(&token, 1, MPI_INT, world_size -1, 0,


printf("Process %d received token %d from process %d\n",

world_rank, token, world_size -1);

}

The ring program initializes a value from process zero, and the value is passed aroundevery single process. The program terminates when process zero receives the value


12/35

from the last process. As you can see from the program, extra care is taken to assure that

it doesnt deadlock. In other words, process zero makes sure that it has completed its

first send before it tries to receive the value from the last process. All of the other

processes simply call MPI_Recv (receiving from their neighboring lower process) and

then MPI_Send (sending the value to their neighboring higher process) to pass the value

along the ring.

MPI_Send and MPI_Recv will block until the message has been transmitted. Because of

this, the printfs should occur by the order in which the value is passed. Using five

processes, the output should look like this.

As we can see, process zero first sends a value of negative one to process one. This

value is passed around the ring until it gets back to process zero.

Up Next

Now that you have a basic understanding of MPI_Send and MPI_Recv, it is now time to

go a little bit deeper into these functions. In the next lesson, I cover how to probe and

dynamically receive messages. Feel free to also examine the beginner MPI tutorial for a

complete reference of all of the beginning MPI lessons.

Dynamic Receiving with MPI Probe (and

MPI Status)

In the previous lesson, I discussed how to use MPI_Send and MPI_Recv to perform

standard point-to-point communication. I only covered how to send messages in which

the length of the message was known beforehand. Although it is possible to send the

length of the message as a separate send / recv operation, MPI natively supports

dynamic messages with just a few additional function calls. I will be going over how touse these functions in this lesson. The code for this tutorial is located here.

The MPI_Status Structure

As covered in the previous lesson, the MPI_Recv operation takes the address of an

MPI_Status structure as an argument (which can be ignored with

MPI_STATUS_IGNORE). If we pass an MPI_Status structure to the MPI_Recv

function, it will be populated with additional information about the receive operation

after it completes. The three primary pieces of information include:


13/35

1. The rank of the sender. The rank of the sender is stored in the MPI_SOURCEelement of the structure. That is, if we declare an MPI_Status stat variable,

the rank can be accessed with stat.MPI_SOURCE.

2. The tag of the message. The tag of the message can be accessed by theMPI_TAG element of the structure (similar to MPI_SOURCE).

3. The length of the message. The length of the message does not have apredefined element in the status structure. Instead, we have to find out the lengthof the message withMPI_Get_count(MPI_Status* status, MPI_Datatype datatype, int*

count)

where countis the total number ofdatatype elements that were received.

Why would any of this information be necessary? It turns out that MPI_Recv can take

MPI_ANY_SOURCE for the rank of the sender and MPI_ANY_TAG for the tag of the

message. For this case, the MPI_Status structure is the only way to find out the actual

sender and tag of the message. Furthermore, MPI_Recv is not guaranteed to receive theentire amount of elements passed as the argument to the function call. Instead, it

receives the amount of elements that were sent to it (and returns an error if more

elements were sent than the desired receive amount). The MPI_Get_count function is

used to determine the actual receive amount.

An Example of Querying the MPI_Status Structure

The program that queries the MPI_Status structure, check_status.c, is provided in the

example code. The program sends a random amount of numbers to a receiver, and the

receiver then finds out how many numbers were sent. The main part of the code looks

like this.

constint MAX_NUMBERS =100;

int numbers[MAX_NUMBERS];

int number_amount;

if(world_rank ==0){

// Pick a random amont of integers to send to process one

srand(time(NULL));

number_amount =(rand()/(float)RAND_MAX)* MAX_NUMBERS;

// Send the amount of integers to process one

MPI_Send(numbers, number_amount, MPI_INT, 1, 0, MPI_COMM_WORLD);

printf("0 sent %d numbers to 1\n", number_amount);


MPI_Status status;

// Receive at most MAX_NUMBERS from process zero

MPI_Recv(numbers, MAX_NUMBERS, MPI_INT, 0, 0, MPI_COMM_WORLD,

&status);

// After receiving the message, check the status to determine

// how many numbers were actually received

MPI_Get_count(&status, MPI_INT, &number_amount);

// Print off the amount of numbers, and also print additional

// information in the status object

printf("1 received %d numbers from 0. Message source = %d, "

"tag = %d\n",

number_amount, status.MPI_SOURCE, status.MPI_TAG);


14/35

}

As we can see, process zero randomly sends up to MAX_NUMBERS integers to

process one. Process one then calls MPI_Recv for a total of MAX_NUMBERS integers.

Although process one is passing MAX_NUMBERS as the argument to MPI_Recv,

process one will receive at most this amount of numbers. In the code, process one callsMPI_Get_count with MPI_INT as the datatype to find out how many integers were

actually received. Along with printing off the size of the received message, process one

also prints off the source and tag of the message by accessing the MPI_SOURCE and

MPI_TAG elements of the status structure.

As a clarification, the return value from MPI_Get_count is relative to the datatype

which is passed. If the user were to use MPI_CHAR as the datatype, the returned

amount would be four times as large (assuming an integer is four bytes and a char is one

byte).

If you run the check_status program, the output should look similar to this.

As expected, process zero sends a random amount of integers to process one, which

prints off information about the received message.

Using MPI_Probe to Find Out the Message Size

Now that you understand how the MPI_Status object works, we can now use it to our

advantage a little bit more. Instead of posting a receive and simply providing a really

large buffer to handle all possible sizes of messages (as we did in the last example), you

can use MPI_Probe to query the message size before actually receiving it. The function

prototype looks like this.

MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status* status)MPI_Probe looks quite similar to MPI_Recv. In fact, you can think of MPI_Probe as an

MPI_Recv that does everything but receive the message. Similar to MPI_Recv,

MPI_Probe will block for a message with a matching tag and sender. When the message

is available, it will fill the status structure with information. The user can then use

MPI_Recv to receive the actual message.

The provided code has an example of this in probe.c. Heres what the main source code

looks like.

int number_amount;if(world_rank ==0){


15/35

constint MAX_NUMBERS =100;

int numbers[MAX_NUMBERS];

// Pick a random amont of integers to send to process one

srand(time(NULL));

number_amount =(rand()/(float)RAND_MAX)* MAX_NUMBERS;

// Send the random amount of integers to process oneMPI_Send(numbers, number_amount, MPI_INT, 1, 0, MPI_COMM_WORLD);

printf("0 sent %d numbers to 1\n", number_amount);


MPI_Status status;

// Probe for an incoming message from process zero

MPI_Probe(0, 0, MPI_COMM_WORLD, &status);

// When probe returns, the status object has the size and other

// attributes of the incoming message. Get the size of the message

MPI_Get_count(&status, MPI_INT, &number_amount);

// Allocate a buffer just big enough to hold the incoming numbers

int* number_buf =(int*)malloc(sizeof(int)* number_amount);

// Now receive the message with the allocated buffer

MPI_Recv(number_buf, number_amount, MPI_INT, 0, 0, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);

printf("1 dynamically received %d numbers from 0.\n",

number_amount);

free(number_buf);

}

Similar to the last example, process zero picks a random amount of numbers to send to

process one. What is different in this example is that process one now calls MPI_Probe

to find out how many elements process zero is trying to send (using MPI_Get_count).Process one then allocates a buffer of the proper size and receives the numbers. Running

the code will look similar to this.

Although this example is trivial, MPI_Probe forms the basis of many dynamic MPI

applications. For example, master / slave programs will often make heavy use of

MPI_Probe when exchanging variable-sized worker messages. As an exercise, make awrapper around MPI_Recv that uses MPI_Probe for any dynamic applications you

might write. It makes the code look much nicer

Next

Do you feel comfortable using the standard blocking point-to-point communication

routines? If so, then you already have the ability to write endless amounts of parallel

applications! Lets look at a more advanced example of using the routines you have

learned. Check out the application example using MPI_Send, MPI_Recv, and

MPI_Probe.


16/35

Point-to-Point Communication

Application Random WalkIts time to go through an application example using some of the concepts introduced in

the sending and receiving tutorial and the MPI_Probe and MPI_Status lesson. The code

for the application can be downloaded here. The application simulates a process which I

refer to as random walking. The basic problem definition of a random walk is as

follows. Given aMin,Max, and random walker W, make walker Wtake Srandom walks

of arbitrary length to the right. If the process goes out of bounds, it wraps back around.

Scan only move one unit to the right or left at a time.

Although the application in itself is very basic, the parallelization of random walking

can simulate the behavior of a wide variety of parallel applications. More on that later.

For now, lets overview how to parallelize the random walk problem.

Parallelization of the Random Walking Problem

Our first task, which is pertinent to many parallel programs, is splitting the domain

across processes. The random walk problem has a one-dimensional domain of sizeMax

Min + 1 (sinceMax andMin are inclusive to the walker). Assuming that walkers canonly take integer-sized steps, we can easily partition the domain into near-equal-sized

chunks across processes. For example, ifMin is 0 andMax is 20 and we have four

processes, the domain would be split like this.

The first three processes own five units of the domain while the last process takes thelast five units plus the one remaining unit.

Once the domain has been partitioned, the application will initialize walkers. As

explained earlier, a walker will take Swalks with a random total walk size. For

example, if the walker takes a walk of size six on process zero (using the previous

domain decomposition), the execution of the walker will go like this:

1. The walker starts taking incremental steps. When it hits value four, however, ithas reached the end of the bounds of process zero. Process zero now has to

communicate the walker to process one.


17/35

2. Process one receives the walker and continues walking until it has reached itstotal walk size of six. The walker can then proceed on a new random walk.

In this example, Wonly had to be communicated one time from process zero to process

one. IfWhad to take a longer walk, however, it may have needed to be passed through

more processes along its path through the domain.

Coding the Application using MPI_Send and MPI_Recv

This application can be coded using MPI_Send and MPI_Recv. Before we begin

looking at code, lets establish some preliminary characteristics and functions of the

program:

Each process determines their part of the domain. Each process initializes exactlyNwalkers, all which start at the first value of

their local domain.

Each walker has two associated integer values: the current position of the walkerand the number of steps left to take.

Walkers start traversing through the domain and are passed to other processesuntil they have completed their walk.

The processes terminate when all walkers have finished.Lets begin by writing code for the domain decomposition. The function will take in the

total domain size and find the appropriate subdomain for the MPI process. It will also

give any remainder of the domain to the final process. For simplicity, I just call

MPI_Abort for any errors that are found. The function, called decompose_domain,

looks like this:

void decompose_domain(int domain_size, int world_rank,

int world_size, int* subdomain_start,

int* subdomain_size){if(world_size > domain_size){

// Don't worry about this special case. Assume the domain size

// is greater than the world size.

MPI_Abort(MPI_COMM_WORLD, 1);

}

*subdomain_start = domain_size / world_size * world_rank;

*subdomain_size = domain_size / world_size;

if(world_rank == world_size -1){

// Give remainder to last process

*subdomain_size += domain_size % world_size;

}

}


18/35

As you can see, the function splits the domain in even chunks, taking care of the case

when a remainder is present. The function returns a subdomain start and a subdomain

size.

Next, we need to create a function that initializes walkers. We first define a walker

structure that looks like this:

typedefstruct{

int location;

int num_steps_left_in_walk;

} Walker;

Our initialization function, called initialize_walkers, takes the subdomain bounds and

adds walkers to an incoming_walkers vector (by the way, this application is in C++).

void initialize_walkers(int num_walkers_per_proc, int max_walk_size,

int subdomain_start, int subdomain_size,vector* incoming_walkers){

Walker walker;

for(int i =0; i < num_walkers_per_proc; i++){

// Initialize walkers in the middle of the subdomain

walker.location= subdomain_start;

walker.num_steps_left_in_walk=

(rand()/(float)RAND_MAX)* max_walk_size;

incoming_walkers->push_back(walker);

}

}

After initialization, it is time to progress the walkers. Lets start off by making a

walking function. This function is responsible for progressing the walker until it hasfinished its walk. If it goes out of local bounds, it is added to the outgoing_walkers

vector.

void walk(Walker* walker, int subdomain_start, int subdomain_size,

int domain_size, vector* outgoing_walkers){

while(walker->num_steps_left_in_walk >0){

if(walker->location == subdomain_start + subdomain_size){

// Take care of the case when the walker is at the end

// of the domain by wrapping it around to the beginning

if(walker->location == domain_size){

walker->location =0;

}outgoing_walkers->push_back(*walker);

break;

}else{

walker->num_steps_left_in_walk--;

walker->location++;

}

}

}

Now that we have established an initialization function (that populates an incoming

walker list) and a walking function (that populates an outgoing walker list), we only

need two more functions: a function that sends outgoing walkers and a function thatreceives incoming walkers. The sending function looks like this:


19/35

void send_outgoing_walkers(vector* outgoing_walkers,

int world_rank, int world_size){

// Send the data as an array of MPI_BYTEs to the next process.

// The last process sends to process zero.

MPI_Send((void*)outgoing_walkers->data(),

outgoing_walkers->size()*sizeof(Walker), MPI_BYTE,

(world_rank +1)% world_size, 0, MPI_COMM_WORLD);// Clear the outgoing walkers

outgoing_walkers->clear();

}

The function that receives incoming walkers should use MPI_Probe since it does not

know beforehand how many walkers it will receive. This is what it looks like:

void receive_incoming_walkers(vector* incoming_walkers,

int world_rank, int world_size){

// Probe for new incoming walkers

MPI_Status status;

// Receive from the process before you. If you are process zero,// receive from the last process

int incoming_rank =

(world_rank ==0)? world_size -1: world_rank -1;

MPI_Probe(incoming_rank, 0, MPI_COMM_WORLD, &status);

// Resize your incoming walker buffer based on how much data is

// being received

int incoming_walkers_size;

MPI_Get_count(&status, MPI_BYTE, &incoming_walkers_size);

incoming_walkers->resize(incoming_walkers_size /sizeof(Walker));

MPI_Recv((void*)incoming_walkers->data(), incoming_walkers_size,

MPI_BYTE, incoming_rank, 0, MPI_COMM_WORLD,

MPI_STATUS_IGNORE);}

Now we have established the main functions of the program. We have to tie all these

function together as follows:

1. Initialize the walkers.2. Progress the walkers with the walkfunction.3. Send out any walkers in the outgoing_walkers vector.4. Receive new walkers and put them in the incoming_walkers vector.5. Repeat steps two through four until all walkers have finished.

The first attempt at writing this program is below. For now, we will not worry about

how to determine when all walkers have finished. Before you look at the code, I must

warn you this code is incorrect! With this in mind, lets look at my code and hopefully

you can see what might be wrong with it.

// Find your part of the domain

decompose_domain(domain_size, world_rank, world_size,

&subdomain_start, &subdomain_size);

// Initialize walkers in your subdomain

initialize_walkers(num_walkers_per_proc, max_walk_size,

subdomain_start, subdomain_size,

&incoming_walkers);


20/35

while(!all_walkers_finished){// Determine walker completion later

// Process all incoming walkers

for(int i =0; i < incoming_walkers.size(); i++){

walk(&incoming_walkers[i], subdomain_start, subdomain_size,

domain_size, &outgoing_walkers);

}

// Send all outgoing walkers to the next process.

send_outgoing_walkers(&outgoing_walkers, world_rank,

world_size);

// Receive all the new incoming walkers

receive_incoming_walkers(&incoming_walkers, world_rank,

world_size);

}

Everything looks normal, but the order of function calls has introduced a very likely

scenario deadlock.

Deadlock and Prevention

According to Wikipedia, deadlock refers to a specific condition when two or more

processes are each waiting for the other to release a resource, or more than two

processes are waiting for resources in a circular chain. In our case, the above code

will result in a circular chain of MPI_Send calls.

It is worth noting that the above code will actually not deadlock most of the time.

Although MPI_Send is a blocking call, the MPI specification says that MPI_Send

blocks until the send buffer can be reclaimed. This means that MPI_Send will return

when the network can buffer the message. If the sends eventually cant be buffered by

the network, they will block until a matching receive is posted. In our case, there are

enough small sends and frequent matching receives to not worry about deadlock,

however, a big enough network buffer should never be assumed.

Since we are only focusing on MPI_Send and MPI_Recv in this lesson, the best way to

avoid the possible sending and receiving deadlock is to order the messaging such thatsends will have matching receives and vice versa. One easy way to do this is to change

our loop around such that even-numbered processes send outgoing walkers before

receiving walkers and odd-numbered processes do the opposite. Given two stages of

execution, the sending and receiving will now look like this:


21/35

Note Executing this with one process can still deadlock. To avoid this, simply

dont perform sends and receives when using one process. You may be asking, does

this still work with an odd number of processes? We can go through a similar diagram

again with three processes:

As you can see, at all three stages, there is at least one posted MPI_Send that matches aposted MPI_Recv, so we dont have to worry about the occurrence of deadlock.

Determining Completion of All Walkers

Now comes the final step of the program determining when every single walker has

finished. Since walkers can walk for a random length, they can finish their journey on

any process. Because of this, it is difficult for all processes to know when all walkers

have finished without some sort of additional communication. One possible solution is

to have process zero keep track of all of the walkers that have finished and then tell all

the other processes when to terminate. This solution, however, is quite cumbersomesince each process would have to report any completed walkers to process zero and then

also handle different types of incoming messages.

For this lesson, we will keep things simple. Since we know the maximum distance that

any walker can travel and the smallest total size it can travel for each pair of sends and

receives (the subdomain size), we can figure out the amount of sends and receives each

process should do before termination. Using this characteristic of the program along

with our strategy to avoid deadlock, the final main part of the program looks like this:

// Find your part of the domain

decompose_domain(domain_size, world_rank, world_size,&subdomain_start, &subdomain_size);

// Initialize walkers in your subdomain


22/35

initialize_walkers(num_walkers_per_proc, max_walk_size,

subdomain_start, subdomain_size,

&incoming_walkers);

// Determine the maximum amount of sends and receives needed to

// complete all walkers

int maximum_sends_recvs =max_walk_size /(domain_size / world_size)+1;

for(int m =0; m < maximum_sends_recvs; m++){

// Process all incoming walkers

for(int i =0; i < incoming_walkers.size(); i++){

walk(&incoming_walkers[i], subdomain_start, subdomain_size,

domain_size, &outgoing_walkers);

}

// Send and receive if you are even and vice versa for odd

if(world_rank %2==0){


world_size);

receive_incoming_walkers(&incoming_walkers, world_rank,world_size);

}else{

receive_incoming_walkers(&incoming_walkers, world_rank,

world_size);


world_size);

}

}

Running the Application

The code for the application can be downloaded here. In contrast to the other lessons,this code uses C++. When installing MPICH2, you also installed the C++ MPI compiler

(unless you explicitly configured it otherwise). If you installed MPICH2 in a local

directory, make sure that you have set your MPICXX environment variable to point to

the correct mpicxx compiler in order to use my makefile.

In my code, I have set up the run script to provide default values for the program: 100

for the domain size, 500 for the maximum walk size, and 20 for the number of walkers

per process. The run script should spawn five MPI processes, and the output should

look similar to this:


23/35

The output continues until processes finish all sending and receiving of all walkers.

So Whats Next?

If you have made it through this entire application and feel comfortable, then good! This

application is quite advanced for a first real application. If you still dont feel

comfortable with MPI_Send, MPI_Recv, and MPI_Probe, Id recommend going

through some of the examples in my recommended books for more practice.

Next, we will start learning about collective communication in MPI, so stay tuned!

Also, at the beginning, I told you that the concepts of this program are applicable to

many parallel programs. I dont want to leave you hanging, so I have included some

additional reading material below for anyone that wishes to learn more. Enjoy

ADDITIONAL READING

Random Walking and Its Similarity to Parallel Particle Tracing

The random walk problem that we just coded, although

seemingly trivial, can actually form the basis of simulating

many types of parallel applications. Some parallel applications

in the scientific domain require many types of randomized sends

and receives. One example application is parallel particle

tracing.

Parallel particle tracing is one of the primary methods that are

used to visualize flow fields. Particles are inserted into the flow

field and then traced along the flow using numerical integration

techniques (such as Runge-Kutta). The traced paths can then be

rendered for visualization purposes. One example rendering is of the tornado image at

the top left.


24/35

Performing efficient parallel particle tracing can be very difficult. The main reason for

this is because the direction in which particles travel can only be determined after each

incremental step of the integration. Therefore, it is hard for processes to coordinate and

balance all communication and computation. To understand this better, lets look at a

typical parallelization of particle tracing.


25/35


26/35

In this illustration, we see that the domain is split among six process. Particles

(sometimes referred to as seeds) are then placed in the subdomains (similar to how we

placed walkers in subdomains), and then they begin tracing. When particles go out of

bounds, they have to be exchanged with processes which have the proper subdomain.

This process is repeated until the particles have either left the entire domain or have

reached a maximum trace length.

The parallel particle tracing problem can be solved with MPI_Send, MPI_Recv, and

MPI_Probe in a similar manner to our application that we just coded. There are,

however, much more sophisticated MPI routines that can get the job done more

efficiently. We will talk about these in the coming lessons

I hope you can now see at least one example of how the random walk problem is similar

to other parallel applications. Stay tuned for more lessons and applications!

MPI Broadcast and CollectiveCommunication

So far in the beginner MPI tutorial, we have examined point-to-point communication,

which is communication between two processes. This lesson is the start of the collective

communication section. Collective communication is a method of communication

which involves participation ofall processes in a communicator. In this lesson, we will

discuss the implications of collective communication and go over a standard collective

routine broadcasting. The code for the lesson can be downloaded here.

Collective Communication and Synchronization Points

One of the things to remember about collective communication is that it implies a

synchronization pointamong processes. This means that all processes must reach a

point in their code before they can all begin executing again.

Before going into detail about collective communication routines, lets examine

synchronization in more detail. As it turns out, MPI has a special function that is

dedicated to synchronizing processes:

MPI_Barrier(MPI_Comm communicator)

The name of the function is quite descriptive the function forms a barrier, and no

processes in the communicator can pass the barrier until all of them call the function.

Heres an illustration. Imagine the horizontal axis represents execution of the program

and the circles represent different processes:


27/35

Process zero first calls MPI_Barrier at the first time snapshot (T 1). While process zerois hung up at the barrier, process one and three eventually make it (T 2). When process

two finally makes it to the barrier (T 3), all of the processes then begin execution again

(T 4).

MPI_Barrier can be useful for many things. One of the primary uses of MPI_Barrier is

to synchronize a program so that portions of the parallel code can be timed accurately.

Want to know how MPI_Barrier is implemented? Sure you do Do you remember the

ring program from the MPI_Send and MPI_Recv tutorial? To refresh your memory, we

wrote a program that passed a token around all processes in a ring-like fashion. This

type of program is one of the simplest methods to implement a barrier since a token

cant be passed around completely until all processes work together.

One final note about synchronization Always remember that every collective call you

make is synchronized. In other words, if you cant successfully complete an

MPI_Barrier, then you also cant successfully complete any collective call. If you try to

call MPI_Barrier or other collective routines without ensuring all processes in the

communicator will also call it, your program will idle. This can be very confusing forbeginners, so be careful!

Broadcasting with MPI_Bcast

A broadcastis one of the standard collective communication techniques. During a

broadcast, one process sends the same data to all processes in a communicator. One of

the main uses of broadcasting is to send out user input to a parallel program, or send out

configuration parameters to all processes.

The communication pattern of a broadcast looks like this:


28/35

In this example, process zero is the rootprocess, and it has the initial copy of data. All

of the other processes receive the copy of data.

In MPI, broadcasting can be accomplished by using MPI_Bcast. The function prototype

looks like this:

MPI_Bcast(void* data, int count, MPI_Datatype datatype, introot, MPI_Comm communicator)

Although the root process and receiver processes do different jobs, they all call the same

MPI_Bcast function. When the root process (in our example, it was process zero) calls

MPI_Bcast, the data variable will be sent to all other processes. When all of the receiver

processes call MPI_Bcast, the data variable will be filled in with the data from the root

process.

Broadcasting with MPI_Send and MPI_Recv

At first, it might seem that MPI_Bcast is just a simple wrapper around MPI_Send and

MPI_Recv. In fact, we can make this wrapper function right now. Our function, called

my_bcastcan be downloaded in the example code for this lesson (my_bcast.c). It takesthe same arguments as MPI_Bcast and looks like this:

void my_bcast(void* data, int count, MPI_Datatype datatype, int root,

MPI_Comm communicator){

int world_rank;

MPI_Comm_rank(communicator, &world_rank);

int world_size;

MPI_Comm_size(communicator, &world_size);

if(world_rank == root){

// If we are the root process, send our data to everyone

int i;

for(i =0; i < world_size; i++){

if(i != world_rank){

MPI_Send(data, count, datatype, i, 0, communicator);

}

}

}else{

// If we are a receiver process, receive the data from the root

MPI_Recv(data, count, datatype, root, 0, communicator,

MPI_STATUS_IGNORE);

}

}


29/35

The root process sends the data to everyone else while the others receive from the root

process. Easy, right? If you download the code and run the program, the program will

print output like this:

Believe it or not, our function is actually very inefficient! Imagine that each process has

only one outgoing/incoming network link. Our function is only using one network link

from process zero to send all the data. A smarter implementation is a tree-based

communication algorithm that can use more of the available network links at once. For

example:

In this illustration, process zero starts off with the data and sends it to process one.Similar to our previous example, process zero also sends the data to process two in the

second stage. The difference with this example is that process one is now helping out

the root process by forwarding the data to process three. During the second stage, two

network connections are being utilized at a time. The network utilization doubles at

every subsequent stage of the tree communication until all processes have received the

data.

Do you think you can code this? Writing this code is a bit outside of the purpose of the

lesson. If you are feeling brave, Parallel Programming with MPI is an excellent book

with a complete example of the problem with code.

Comparison of MPI_Bcast with MPI_Send and MPI_Recv

The MPI_Bcast implementation utilizes a similar tree broadcast algorithm for good

network utilization. How does our broadcast function compare to MPI_Bcast? We can

run compare_bcast, an example program included in the lesson code. Before looking at

the code, lets first go over one of MPIs timing functions MPI_Wtime(). MPI_Wtime

takes no arguments, and it simply returns a floating-point number of seconds since a set

time in the past. Similar to Cs time function, you can call multiple MPI_Wtime

functions throughout your program and subtract their differences to obtain timing of

code segments.


30/35

Lets take a look of our code that compares my_bcast to MPI_Bcast

for(i =0; i < num_trials; i++){

// Time my_bcast

// Synchronize before starting timing

MPI_Barrier(MPI_COMM_WORLD);

total_my_bcast_time -= MPI_Wtime();

my_bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);

// Synchronize again before obtaining final time


total_my_bcast_time += MPI_Wtime();

// Time MPI_Bcast


total_mpi_bcast_time -= MPI_Wtime();

MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);


total_mpi_bcast_time += MPI_Wtime();

}

In this code, num_trials is a variable stating how many timing experiments should be

executed. We keep track of the accumulated time of both functions in two different

variables. The average times are printed at the end of the program. To see the entire

code, just download the lesson code and look at compare_bcast.c.

When you use the run script to execute the code, the output will look similar to this.

The run script executes the code using 16 processors, 100,000 integers per broadcast,

and 10 trial runs for timing results. As you can see, my experiment using 16 processors

connected via ethernet shows significant timing differences between our naive

implementation and MPIs implementation. Here is what the timing results look like at

all scales.

Processors my_bcast MPI_Bcast

2 0.0344 0.0344

4 0.1025 0.0817

8 0.2385 0.1084

16 0.5109 0.1296

As you can see, there is no difference between the two implementations at two

processors. This is because MPI_Bcasts tree implementation does not provide any

additional network utilization when using two processors. However, the differences can

clearly be observed when going up to even as little as 16 processors.


31/35

Try running the code yourself and experiment at larger scales!

Conclusions / Up Next

Feel a little better about collective routines? In the next MPI tutorial, I go over other

essential collective communication routines gathering and scattering.

For all beginner lessons, go the the beginner MPI tutorial.

MPI Scatter, Gather, and Allgather

In the previous lesson, we went over the essentials of collective communication. We

covered the most basic collective communication routine MPI_Bcast. In this lesson,

we are going to expand on collective communication routines by going over two very

important routines MPI_Scatter and MPI_Gather. We will also cover a variant of

MPI_Gather, known as MPI_Allgather. The code for this tutorial is available here.

An Introduction to MPI_Scatter

MPI_Scatter is a collective routine that is very similar to MPI_Bcast (If you are

unfamiliar with these terms, please read the previous lesson). MPI_Scatter involves a

designated root process sending data to all processes in a communicator. The primary

difference between MPI_Bcast and MPI_Scatter is small but important. MPI_Bcast

sends the same piece of data to all processes while MPI_Scatter sends chunks of an

array to different processes. Check out the illustration below for further clarification.

In the illustration, MPI_Bcast takes a single data element at the root process (the redbox) and copies it to all other processes. MPI_Scatter takes an array of elements and

distributes the elements in the order of process rank. The first element (in red) goes to

process zero, the second element (in green) goes to process one, and so on. Although


32/35

the root process (process zero) contains the entire array of data, MPI_Scatter will copy

the appropriate element into the receiving buffer of the process. Here is what the

function prototype of MPI_Scatter looks like.

MPI_Scatter(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatyperecv_datatype, int root, MPI_Comm communicator)

Yes, the function looks big and scary, but lets examine it in more detail. The first

parameter, send_data, is an array of data that resides on the root process. The second

and third parameters, send_countand send_datatype, dictate how many elements of a

specific MPI Datatype will be sent to each process. Ifsend_countis one and

send_datatype is MPI_INT, then process zero gets the first integer of the array, process

one gets the second integer, and so on. Ifsend_countis two, then process zero gets the

first and second integers, process one gets the third and fourth, and so on. In practice,

send_countis often equal to the number of elements in the array divided by the number

of processes. Whats that you say? The number of elements isnt divisible by thenumber of processes? Dont worry, we will cover that in a later lesson

The receiving parameters of the function prototype are nearly identical in respect to the

sending parameters. The recv_data parameter is a buffer of data that can hold

recv_countelements that have a datatype ofrecv_datatype. The last parameters, root

and communicator, indicate the root process that is scattering the array of data and the

communicator in which the processes reside.

An Introduction to MPI_Gather

MPI_Gather is the inverse of MPI_Scatter. Instead of spreading elements from oneprocess to many processes, MPI_Gather takes elements from many processes and

gathers them to one single process. This routine is highly useful to many parallel

algorithms, such as parallel sorting and searching. Below is a simple illustration of this

algorithm.

Similar to MPI_Scatter, MPI_Gather takes elements from each process and gathers

them to the root process. The elements are ordered by the rank of the process from

which they were received. The function prototype for MPI_Gather is identical to that of

MPI_Scatter.

MPI_Gather(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatyperecv_datatype, int root, MPI_Comm communicator)


33/35

In MPI_Gather, only the root process needs to have a valid receive buffer. All other

calling processes can pass NULL for recv_data. Also, dont forget that the recv_count

parameter is the count of elements receivedper process, not the total summation of

counts from all processes. This can often confuse beginning MPI programmers.

Computing Average of Numbers with MPI_Scatter and MPI_Gather

In the code for this lesson, I have provided an example program that computes the

average across all numbers in an array. The program is in avg.c. Although the program

is quite simple, it demonstrates how one can use MPI to divide work across processes,

perform computation on subsets of data, and then aggregate the smaller pieces into the

final answer. The program takes the following steps:

1. Generate a random array of numbers on the root process (process 0).2. Scatter the numbers to all processes, giving each process an equal amount of

numbers.

3. Each process computes the average of their subset of the numbers.4. Gather all averages to the root process. The root process then computes the

average of these numbers to get the final average.

The main part of the code with the MPI calls looks like this:

if(world_rank ==0){

rand_nums = create_rand_nums(elements_per_proc * world_size);

}

// Create a buffer that will hold a subset of the random numbers

float*sub_rand_nums =malloc(sizeof(float)* elements_per_proc);

// Scatter the random numbers to all processes

MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums,

elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD);

// Compute the average of your subset

float sub_avg = compute_avg(sub_rand_nums, elements_per_proc);

// Gather all partial averages down to the root process

float*sub_avgs =NULL;

if(world_rank ==0){

sub_avgs =malloc(sizeof(float)* world_size);

}

MPI_Gather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0,

MPI_COMM_WORLD);

// Compute the total average of all numbers.

if(world_rank ==0){

float avg = compute_avg(sub_avgs, world_size);

}

At the beginning of the code, the root process creates an array of random numbers.

When MPI_Scatter is called, each process now contains elements_per_proc elements of

the original data. Each process computes the average of their subset of data and then the

root process gathers each individual average. The total average is computed on this

much smaller array of numbers.


34/35

Using the run script included in the code for this lesson, the output of your program

should be similar to the following. Note that the numbers are randomly generated, so

your final result might be different from mine.

MPI_Allgather and Modification of Average Program

So far, we have covered two MPI routines that perform many-to-one or one-to-many

communication patterns, which simply means that many processes send/receive to one

process. Oftentimes it is useful to be able to send many elements to many processes (i.e.

a many-to-many communication pattern). MPI_Allgather has this characteristic.

Given a set of elements distributed across all processes, MPI_Allgather will gather all

of the elements to all the processes. In the most basic sense, MPI_Allgather is an

MPI_Gather followed by an MPI_Bcast. The illustration below shows how data is

distributed after a call to MPI_Allgather.

Just like MPI_Gather, the elements from each process are gathered in order of their

rank, except this time the elements are gathered to all processes. Pretty easy, right? The

function declaration for MPI_Allgather is almost identical to MPI_Gather with the

difference that there is no root process in MPI_Allgather.

MPI_Allgather(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatype

recv_datatype, MPI_Comm communicator)

I have modified the average computation code to use MPI_Allgather. You can view the

source in all_avg.c from the lesson code. The main difference in the code is shown

below.

// Gather all partial averages down to all the processes

float*sub_avgs =(float*)malloc(sizeof(float)* world_size);

MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT,

MPI_COMM_WORLD);


35/35

// Compute the total average of all numbers.

float avg = compute_avg(sub_avgs, world_size);

The partial averages are now gathered to everyone using MPI_Allgather. The averages

are now printed off from all of the processes. Example output of the program should

look like the following:

As you may have noticed, the only difference between all_avg.c and avg.c is thatall_avg.c prints the average across all processes with MPI_Allgather.

Up Next

In the next lesson, I will cover some of the more complex collective communication

algorithms. Stay tuned! Feel free to leave any comments or questions about the lesson.

For all beginner lessons, go the the beginner MPI tutorial.

clase 4 - tutorial de mpi

Documents