clase 4 - tutorial de mpi
TRANSCRIPT
-
8/22/2019 Clase 4 - Tutorial de MPI
1/35
Beginner MPI Tutorial
Welcome to the MPI tutorial for beginners! In this tutorial, you will learn all of the
basic concepts of MPI by going through various examples. The different parts of the
tutorial are meant to build on top of one another. If you feel lost during any lesson, feelfree to leave a comment on the post explaining your dilemma. Either myself or another
MPI expert will likely be able to get back to you soon.
This beginning tutorial assumes that the reader has a general knowledge of how parallel
programming works, has experience with working on a Linux system, and can also
understand the C programming language.
Introduction
MPI Introduction Installing MPICH2 Running an MPI Hello World Application
Blocking Point-to-Point Communication
MPI Send and Receive Dynamic Receiving with MPI_Probe (and MPI_Status) Point-to-Point Communication Application Example Random Walk
Collective Communication
MPI Broadcast and Collective Communication MPI Scatter, Gather, and Allgather
MPI Introduction
The Message Passing Interface (MPI) first appeared as a standard in 1994 for
performing distributed-memory parallel computing. Since then, it has become the
dominant model for high-performance computing, and it is used widely in research,
academia, and industry.
The functionality of MPI is extremely rich, offering the programmer with the ability to
perform: point-to-point communication, collective communication, one-sided
communication, parallel I/O, and even dynamic process management. These terms
probably sound quite strange to a beginner, but by the end of all of the tutorials, the
terminology will be common place.
Before starting the tutorials, familiarize yourself with the basic concepts below. These
are all related to MPI, and many of these concepts are referred to throughout the
tutorials.
The Message Passing Model
-
8/22/2019 Clase 4 - Tutorial de MPI
2/35
The message passing model is a model of parallel programming in which processes can
only share data by messages. MPI adheres to this model. If one process wishes to
transfer data to another, it must initiate a message and explicitly send data to that
process. The other process will also have to explicitly receive the other message (except
in the case of one-sided communication, but we will get to that later).
Forcing communication to happen in this way offers several advantages for parallel
programs. For example, the message passing model is portable across a wide range of
architectures. An MPI program can run across computers that are spread across the
globe and connected by the internet, or it can execute on tightly-coupled clusters. An
MPI program can even run on the cores of a shared-memory processor and pass
messages through the shared memory. All of these details are abstracted by the
interface. The debugging of these programs is often easier too, since one does not need
to worry about processes overwriting the address space of another.
MPIs Design for the Message Passing Model
MPI has a couple classic concepts that encourage clear parallel program design using
the message passing model. The first is the notion of a communicator. A communicator
defines a group of processes that have the ability to communicate with one another. In
this group of processes, each is assigned a unique rank, and they explicitly
communicate with one another by their ranks.
The foundation of communication is built upon the simple send and receive operations.
A process may send a message to another process by providing the rank of the process
and a unique tag to identify the message. The receiver can then post a receive for a
message with a given tag (or it may not even care about the tag), and then handle the
data accordingly. Communications such as this which involve one sender and
receiver are known aspoint-to-pointcommunications.
There are many cases where processes may need to communicate with everyone else.For example, when a master process needs to broadcast information to all of its worker
processes. In this case, it would be cumbersome to write code that does all of the sends
and receives. In fact, it would often not use the network in an optimal manner. MPI can
handle a wide variety of these types ofcollective communications that involve all
processes.
Mixtures of point-to-point and collective communications can be used to create highlycomplex parallel programs. In fact, this functionality is so powerful that it is not even
necessary to start describing the advanced mechanisms of MPI. We will save that until a
later lesson. For now, you should work on installing MPI on your machine. If you
already have MPI installed, great! You can head over to the MPI Hello World lesson.
Installing MPICH2
MPI is simply a standard which others follow in their implementation. Because of this,
there are a wide variety of MPI implementations out there. One of the most popular
implementations, MPICH2, will be used for all of the examples provided through thissite. Users are free to use any implementation they wish, but only instructions for
-
8/22/2019 Clase 4 - Tutorial de MPI
3/35
installing MPICH2 will be provided. Furthermore, the scripts and code provided for the
lessons are only guaranteed to execute and run with the lastest version of MPICH2.
MPICH2 is a widely-used implementation of MPI that is developed primarily by
Argonne National Laboratory in the United States. The main reason for choosing
MPICH2 over other implementations is simply because of my familiarity with theinterface and because of my close relationship with Argonne National Laboratory. I also
encourage others to check out OpenMPI, which is also a widely-used implementation.
Installing MPICH2
The latest version of MPICH2 is available here. The version that I will be using for all
of the examples on the site is 1.4, which was released June 16, 2011. Go ahead and
download the source code, uncompress the folder, and change into the MPICH2
directory.
Once doing this, you should be able to configure your installation by performing
./configure. I added a couple of parameters to my configuration to avoid building theMPI Fortran library. If you need to install MPICH2 to a local directory (for example, if
you dont have root access to your machine), type ./configure --
prefix=/installation/directory/path For more information about possible
configuration parameters, type ./configure --help
When configuration is done, it should say Configuration completed. Once this is
through, it is time to build and install MPICH2 with make; sudo make install.
If your build was successful, you should be able to type mpich2version and see
something similar to this.
-
8/22/2019 Clase 4 - Tutorial de MPI
4/35
Hopefully your build finished successfully. If not, you may have issues with missingdependencies. For any issue, I highly recommend copying and pasting the error
message directly into Google.
Running an MPI Program
Now that you have installed MPICH, whether its on your local machine or cluster, it is
time to run a simple application. The MPI Hello World lesson goes over the basics of anMPI program, along with a guide on how to run MPICH2 for the first time.
MPI Hello World
In this lesson, I will show you a basic MPI Hello World application and also discuss
how to run an MPI program. The lesson will cover the basics of initializing MPI and
running an MPI job across several processes. This lesson is intended to work with
installations of MPICH2 (specifically 1.4). If you have not installed MPICH2, please
refer back to the installing MPICH2 lesson.
MPI Hello WorldFirst of all, the source code for this lesson can be downloaded here. Download it, extract
it, and change to the example directory. The directory should contain three files:
makefile, mpi_hello_world.c, and run.perl.
Open the mpi_hello_world.c source code. Below are some excerpts from the code.
-
8/22/2019 Clase 4 - Tutorial de MPI
5/35
#include
int main(int argc, char** argv){
// Initialize the MPI environment
MPI_Init(NULL, NULL);
// Get the number of processes
int world_size;MPI_Comm_size(MPI_COMM_WORLD, &world_size);
// Get the rank of the process
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
// Get the name of the processor
char processor_name[MPI_MAX_PROCESSOR_NAME];
int name_len;
MPI_Get_processor_name(processor_name, &name_len);
// Print off a hello world message
printf("Hello world from processor %s, rank %d"" out of %d processors\n",
processor_name, world_rank, world_size);
// Finalize the MPI environment.
MPI_Finalize();
}
You will notice that the first step to building an MPI program is including the MPI
header files with #include . After this, the MPI environment must be
initialized with MPI_Init(NULL, NULL). During MPI_Init, all of MPIs global and
internal variables are constructed. For example, a communicator is formed around all of
the processes that were spawned, and unique ranks are assigned to each process.Currently, MPI_Init takes two arguments that are not necessary, and the extra
parameters are simply left as extra space in case future implementations might need
them.
After MPI_Init, there are two main functions that are called. These two functions are
used in almost every single MPI program that you will write.
MPI_Comm_size(MPI_Comm communicator, int* size) Returns the size ofa communicator. In our example, MPI_COMM_WORLD (which is constructed
for us by MPI) encloses all of the processes in the job, so this call should return
the amount of processes that were requested for the job.
MPI_Comm_rank(MPI_Comm communicator, int* rank) Returns the rank ofa process in a communicator. Each process inside of a communicator is assigned
an incremental rank starting from zero. The ranks of the processes are primarily
used for identification purposes when sending and receiving messages.
A miscellaneous and less-used function in this program is
MPI_Get_processor_name(char* name, int* name_length), which can obtain the
actual name of the processor on which the process is executing. The final call in this
program, MPI_Finalize() is used to clean up the MPI environment. No more MPI
calls can be made after this one.
-
8/22/2019 Clase 4 - Tutorial de MPI
6/35
Running MPI Hello World
Now compile the example by typing make. My makefile looks for the MPICCenvironment variable. If you installed MPICH2 to a local directory, set your MPICC
environment variable to point to your mpicc binary. The mpicc program in yourinstallation is really just a wrapper around gcc, and it makes compiling and linking all
of the necessary MPI routines much easier.
After your program is compiled, it is ready to be executed. Now comes the part where
you might have to do some additional configuration. If you are running MPI programs
on a cluster of nodes, you will have to set up a host file. If you are simply running MPI
on a laptop or a single machine, disregard the next piece of information.
The host file contains names of all of the computers on which your MPI job will
execute. For ease of execution, you should be sure that all of these computers have SSH
access, and you should also setup an authorized keys file to avoid a password prompt
for SSH. My host file looks like this.
For the run script that I have provided in the download, you should set an environment
variable called MPI_HOSTS and have it point to your hosts file. My script will
automatically include it in the command line when the MPI job is launched. If you do
not need a hosts file, simply do not set the environment variable. Also, if you have a
local installation of MPI, you should set the MPIRUN environment variable to point to
the mpirun binary from the installation. After this, call ./run.perl mpi_hello_world
to run the example application.
As expected, the MPI program was launched across all of the hosts in my host file. Each
process was assigned a unique rank, which was printed off along with the process name.
As one can see from my example output, the output of the processes is in an arbitrary
order since there is no synchronization involved before printing.
-
8/22/2019 Clase 4 - Tutorial de MPI
7/35
Notice how the script called mpirun. This is program that the MPI implementation uses
to launch the job. Processes are spawned across all the hosts in the host file and the MPI
program executes across each process. My script automatically supplies the -n flag to
set the number of MPI processes to four. Try changing the run script and launching
more processes! Dont accidentally crash your system though.
Now you might be asking, My hosts are actually dual-core machines. How can I get
MPI to spawn processes across the individual cores first before individual machines?
The solution is pretty simple. Just modify your hosts file and place a colon and the
number of cores per processor after the host name. For example, I specified that each of
my hosts has two cores.
When I execute the run script again, voila!, the MPI job spawns two processes on only
two of my hosts.
Up Next
Now that you have a basic understanding of how an MPI program is executed, it is now
time to learn fundamental point-to-point communication routines. In the next lesson, I
cover basic sending and receiving routines in MPI. Feel free to also examine the
beginner MPI tutorial for a complete reference of all of the beginning MPI lessons.
MPI Send and Receive
Sending and receiving are the two foundational concepts of MPI. Almost every single
function in MPI can be implemented with basic send and receive calls. In this lesson, I
will discuss how to use MPIs blocking sending and receiving functions, and I will also
overview other basic concepts associated with transmitting data using MPI. The code
for this tutorial is available here.
Overview of Sending and Receiving with MPI
MPIs send and receive calls operate in the following manner. First, processA decides a
message needs to be sent to processB. Process A then packs up all of its necessary data
into a buffer for process B. These buffers are often referred to as envelopes since thedata is being packed into a single message before transmission (similar to how letters
-
8/22/2019 Clase 4 - Tutorial de MPI
8/35
are packed into envelopes before transmission to the post office). After the data is
packed into a buffer, the communication device (which is often a network) is
responsible for routing the message to the proper location. The location of the message
is defined by the processs rank.
Even though the message is routed to B, process B still has to acknowledge that it wantsto receive As data. Once it does this, the data has been transmitted. Process A is
acknowledged that the data has been transmitted and may go back to work.
Sometimes there are cases when A might have to send many different types of messages
to B. Instead of B having to go through extra measures to differentiate all these
messages, MPI allows senders and receivers to also specify message IDs with the
message (known as tags). When process B only requests a message with a certain tag
number, messages with different tags will be buffered by the network until B is ready
for them.
With these concepts in mind, lets look at the prototypes for the MPI sending andreceiving functions.
MPI_Send(void* data, int count, MPI_Datatype datatype, intdestination, int tag, MPI_Comm communicator)
MPI_Recv(void* data, int count, MPI_Datatype datatype, intsource, int tag, MPI_Comm communicator, MPI_Status* status)
Although this might seem like a mouthful when reading all of the arguments, they
become easier to remember since almost every MPI call uses similar syntax. The first
argument is the data buffer. The second and third arguments describe the count and type
of elements that reside in the buffer. MPI_Send sends the exact count of elements, andMPI_Recv will receive at most the count of elements (more on this in the next lesson).
The fourth and fifth arguments specify the rank of the sending/receiving process and the
tag of the message. The sixth argument specifies the communicator and the last
argument (for MPI_Recv only) provides information about the received message.
Elementary MPI Datatypes
The MPI_Send and MPI_Recv functions utilize MPI Datatypes as a means to specify
the structure of a message at a higher level. For example, if the process wishes to send
one integer to another, it would use a count of one and a datatype of MPI_INT. The
other elementary MPI datatypes are listed below with their equivalent C datatypes.
MPI_CHAR char
MPI_SHORT short int
MPI_INT int
MPI_LONG long int
MPI_LONG_LONG long long int
MPI_UNSIGNED_CHAR unsigned char
MPI_UNSIGNED_SHORT unsigned short int
MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
-
8/22/2019 Clase 4 - Tutorial de MPI
9/35
MPI_UNSIGNED_LONG_LONG unsigned long long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
For now, we will only make use of these datatypes in the beginner MPI tutorial. Once
we have covered enough basics, you will learn how to create your own MPI datatypes
for characterizing more complex types of messages.
MPI Send / Recv Program
The code for this tutorial is available here. Go ahead and download and extract the code.
I refer the reader back to the MPI Hello World Lesson for instructions on how to use my
code packages.
The first example is in send_recv.c. Some of the major parts of the program are shown
below.
// Find out rank, size
int world_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
int world_size;
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
int number;
if(world_rank ==0){
number =-1;MPI_Send(&number, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
}elseif(world_rank ==1){
MPI_Recv(&number, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Process 1 received number %d from process 0\n",
number);
}
MPI_Comm_rank and MPI_Comm_size are first used to determine the world size alongwith the rank of the process. Then process zero initializes a number to the value of
negative one and sends this value to process one. As you can see in the else ifstatement,
process one is calling MPI_Recv to receive the number. It also prints off the receivedvalue.
Since we are sending and receiving exactly one integer, each process requests that one
MPI_INT be sent/received. Each process also uses a tag number of zero to identify the
message. The processes could have also used the predefined constantMPI_ANY_TAG
for the tag number since only one type of message was being transmitted.
Running the example program looks like this.
-
8/22/2019 Clase 4 - Tutorial de MPI
10/35
As expected, process one receives negative one from process zero.
MPI Ping Pong Program
The next example is a ping pong program. In this example, processes use MPI_Send
and MPI_Recv to continually bounce messages off of each other until they decide to
stop. Take a look at ping_pong.c in the example code download. The major portions of
the code look like this.
int ping_pong_count =0;
int partner_rank =(world_rank +1)%2;
while(ping_pong_count < PING_PONG_LIMIT){
if(world_rank == ping_pong_count %2){
// Increment the ping pong count before you send it
ping_pong_count++;
MPI_Send(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
MPI_COMM_WORLD);
printf("%d sent and incremented ping_pong_count "
"%d to %d\n", world_rank, ping_pong_count,
partner_rank);
}else{
MPI_Recv(&ping_pong_count, 1, MPI_INT, partner_rank, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("%d received ping_pong_count %d from %d\n",
world_rank, ping_pong_count, partner_rank);
}
}
This example is meant to be executed with only two processes. The processes firstdetermine their partner with some simple arithmetic. Aping_pong_countis initiated to
zero and it is incremented at each ping pong step by the sending process. As the
ping_pong_count is incremented, the processes take turns being the sender and receiver.Finally, after the limit is reached (ten in my code), the processes stop sending and
receiving. The output of the example code will look something like this.
-
8/22/2019 Clase 4 - Tutorial de MPI
11/35
The output of the programs of others will likely be different. However, as you can see,
process zero and one are both taking turns sending and receiving the ping pong counter
to each other.
Ring Program
I have included one more example of MPI_Send and MPI_Recv using more than two
processes. In this example, a value is passed around by all processes in a ring-likefashion. Take a look at ring.c in the example code download. The major portion of the
code looks like this.
int token;
if(world_rank !=0){
MPI_Recv(&token, 1, MPI_INT, world_rank -1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
world_rank, token, world_rank -1);
}else{
// Set the token's value if you are process 0
token =-1;}
MPI_Send(&token, 1, MPI_INT, (world_rank +1)% world_size,
0, MPI_COMM_WORLD);
// Now process 0 can receive from the last process.
if(world_rank ==0){
MPI_Recv(&token, 1, MPI_INT, world_size -1, 0,
MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Process %d received token %d from process %d\n",
world_rank, token, world_size -1);
}
The ring program initializes a value from process zero, and the value is passed aroundevery single process. The program terminates when process zero receives the value
-
8/22/2019 Clase 4 - Tutorial de MPI
12/35
from the last process. As you can see from the program, extra care is taken to assure that
it doesnt deadlock. In other words, process zero makes sure that it has completed its
first send before it tries to receive the value from the last process. All of the other
processes simply call MPI_Recv (receiving from their neighboring lower process) and
then MPI_Send (sending the value to their neighboring higher process) to pass the value
along the ring.
MPI_Send and MPI_Recv will block until the message has been transmitted. Because of
this, the printfs should occur by the order in which the value is passed. Using five
processes, the output should look like this.
As we can see, process zero first sends a value of negative one to process one. This
value is passed around the ring until it gets back to process zero.
Up Next
Now that you have a basic understanding of MPI_Send and MPI_Recv, it is now time to
go a little bit deeper into these functions. In the next lesson, I cover how to probe and
dynamically receive messages. Feel free to also examine the beginner MPI tutorial for a
complete reference of all of the beginning MPI lessons.
Dynamic Receiving with MPI Probe (and
MPI Status)
In the previous lesson, I discussed how to use MPI_Send and MPI_Recv to perform
standard point-to-point communication. I only covered how to send messages in which
the length of the message was known beforehand. Although it is possible to send the
length of the message as a separate send / recv operation, MPI natively supports
dynamic messages with just a few additional function calls. I will be going over how touse these functions in this lesson. The code for this tutorial is located here.
The MPI_Status Structure
As covered in the previous lesson, the MPI_Recv operation takes the address of an
MPI_Status structure as an argument (which can be ignored with
MPI_STATUS_IGNORE). If we pass an MPI_Status structure to the MPI_Recv
function, it will be populated with additional information about the receive operation
after it completes. The three primary pieces of information include:
-
8/22/2019 Clase 4 - Tutorial de MPI
13/35
1. The rank of the sender. The rank of the sender is stored in the MPI_SOURCEelement of the structure. That is, if we declare an MPI_Status stat variable,
the rank can be accessed with stat.MPI_SOURCE.
2. The tag of the message. The tag of the message can be accessed by theMPI_TAG element of the structure (similar to MPI_SOURCE).
3. The length of the message. The length of the message does not have apredefined element in the status structure. Instead, we have to find out the lengthof the message withMPI_Get_count(MPI_Status* status, MPI_Datatype datatype, int*
count)
where countis the total number ofdatatype elements that were received.
Why would any of this information be necessary? It turns out that MPI_Recv can take
MPI_ANY_SOURCE for the rank of the sender and MPI_ANY_TAG for the tag of the
message. For this case, the MPI_Status structure is the only way to find out the actual
sender and tag of the message. Furthermore, MPI_Recv is not guaranteed to receive theentire amount of elements passed as the argument to the function call. Instead, it
receives the amount of elements that were sent to it (and returns an error if more
elements were sent than the desired receive amount). The MPI_Get_count function is
used to determine the actual receive amount.
An Example of Querying the MPI_Status Structure
The program that queries the MPI_Status structure, check_status.c, is provided in the
example code. The program sends a random amount of numbers to a receiver, and the
receiver then finds out how many numbers were sent. The main part of the code looks
like this.
constint MAX_NUMBERS =100;
int numbers[MAX_NUMBERS];
int number_amount;
if(world_rank ==0){
// Pick a random amont of integers to send to process one
srand(time(NULL));
number_amount =(rand()/(float)RAND_MAX)* MAX_NUMBERS;
// Send the amount of integers to process one
MPI_Send(numbers, number_amount, MPI_INT, 1, 0, MPI_COMM_WORLD);
printf("0 sent %d numbers to 1\n", number_amount);
}elseif(world_rank ==1){
MPI_Status status;
// Receive at most MAX_NUMBERS from process zero
MPI_Recv(numbers, MAX_NUMBERS, MPI_INT, 0, 0, MPI_COMM_WORLD,
&status);
// After receiving the message, check the status to determine
// how many numbers were actually received
MPI_Get_count(&status, MPI_INT, &number_amount);
// Print off the amount of numbers, and also print additional
// information in the status object
printf("1 received %d numbers from 0. Message source = %d, "
"tag = %d\n",
number_amount, status.MPI_SOURCE, status.MPI_TAG);
-
8/22/2019 Clase 4 - Tutorial de MPI
14/35
}
As we can see, process zero randomly sends up to MAX_NUMBERS integers to
process one. Process one then calls MPI_Recv for a total of MAX_NUMBERS integers.
Although process one is passing MAX_NUMBERS as the argument to MPI_Recv,
process one will receive at most this amount of numbers. In the code, process one callsMPI_Get_count with MPI_INT as the datatype to find out how many integers were
actually received. Along with printing off the size of the received message, process one
also prints off the source and tag of the message by accessing the MPI_SOURCE and
MPI_TAG elements of the status structure.
As a clarification, the return value from MPI_Get_count is relative to the datatype
which is passed. If the user were to use MPI_CHAR as the datatype, the returned
amount would be four times as large (assuming an integer is four bytes and a char is one
byte).
If you run the check_status program, the output should look similar to this.
As expected, process zero sends a random amount of integers to process one, which
prints off information about the received message.
Using MPI_Probe to Find Out the Message Size
Now that you understand how the MPI_Status object works, we can now use it to our
advantage a little bit more. Instead of posting a receive and simply providing a really
large buffer to handle all possible sizes of messages (as we did in the last example), you
can use MPI_Probe to query the message size before actually receiving it. The function
prototype looks like this.
MPI_Probe(int source, int tag, MPI_Comm comm, MPI_Status* status)MPI_Probe looks quite similar to MPI_Recv. In fact, you can think of MPI_Probe as an
MPI_Recv that does everything but receive the message. Similar to MPI_Recv,
MPI_Probe will block for a message with a matching tag and sender. When the message
is available, it will fill the status structure with information. The user can then use
MPI_Recv to receive the actual message.
The provided code has an example of this in probe.c. Heres what the main source code
looks like.
int number_amount;if(world_rank ==0){
-
8/22/2019 Clase 4 - Tutorial de MPI
15/35
constint MAX_NUMBERS =100;
int numbers[MAX_NUMBERS];
// Pick a random amont of integers to send to process one
srand(time(NULL));
number_amount =(rand()/(float)RAND_MAX)* MAX_NUMBERS;
// Send the random amount of integers to process oneMPI_Send(numbers, number_amount, MPI_INT, 1, 0, MPI_COMM_WORLD);
printf("0 sent %d numbers to 1\n", number_amount);
}elseif(world_rank ==1){
MPI_Status status;
// Probe for an incoming message from process zero
MPI_Probe(0, 0, MPI_COMM_WORLD, &status);
// When probe returns, the status object has the size and other
// attributes of the incoming message. Get the size of the message
MPI_Get_count(&status, MPI_INT, &number_amount);
// Allocate a buffer just big enough to hold the incoming numbers
int* number_buf =(int*)malloc(sizeof(int)* number_amount);
// Now receive the message with the allocated buffer
MPI_Recv(number_buf, number_amount, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("1 dynamically received %d numbers from 0.\n",
number_amount);
free(number_buf);
}
Similar to the last example, process zero picks a random amount of numbers to send to
process one. What is different in this example is that process one now calls MPI_Probe
to find out how many elements process zero is trying to send (using MPI_Get_count).Process one then allocates a buffer of the proper size and receives the numbers. Running
the code will look similar to this.
Although this example is trivial, MPI_Probe forms the basis of many dynamic MPI
applications. For example, master / slave programs will often make heavy use of
MPI_Probe when exchanging variable-sized worker messages. As an exercise, make awrapper around MPI_Recv that uses MPI_Probe for any dynamic applications you
might write. It makes the code look much nicer
Next
Do you feel comfortable using the standard blocking point-to-point communication
routines? If so, then you already have the ability to write endless amounts of parallel
applications! Lets look at a more advanced example of using the routines you have
learned. Check out the application example using MPI_Send, MPI_Recv, and
MPI_Probe.
-
8/22/2019 Clase 4 - Tutorial de MPI
16/35
Point-to-Point Communication
Application Random WalkIts time to go through an application example using some of the concepts introduced in
the sending and receiving tutorial and the MPI_Probe and MPI_Status lesson. The code
for the application can be downloaded here. The application simulates a process which I
refer to as random walking. The basic problem definition of a random walk is as
follows. Given aMin,Max, and random walker W, make walker Wtake Srandom walks
of arbitrary length to the right. If the process goes out of bounds, it wraps back around.
Scan only move one unit to the right or left at a time.
Although the application in itself is very basic, the parallelization of random walking
can simulate the behavior of a wide variety of parallel applications. More on that later.
For now, lets overview how to parallelize the random walk problem.
Parallelization of the Random Walking Problem
Our first task, which is pertinent to many parallel programs, is splitting the domain
across processes. The random walk problem has a one-dimensional domain of sizeMax
Min + 1 (sinceMax andMin are inclusive to the walker). Assuming that walkers canonly take integer-sized steps, we can easily partition the domain into near-equal-sized
chunks across processes. For example, ifMin is 0 andMax is 20 and we have four
processes, the domain would be split like this.
The first three processes own five units of the domain while the last process takes thelast five units plus the one remaining unit.
Once the domain has been partitioned, the application will initialize walkers. As
explained earlier, a walker will take Swalks with a random total walk size. For
example, if the walker takes a walk of size six on process zero (using the previous
domain decomposition), the execution of the walker will go like this:
1. The walker starts taking incremental steps. When it hits value four, however, ithas reached the end of the bounds of process zero. Process zero now has to
communicate the walker to process one.
-
8/22/2019 Clase 4 - Tutorial de MPI
17/35
2. Process one receives the walker and continues walking until it has reached itstotal walk size of six. The walker can then proceed on a new random walk.
In this example, Wonly had to be communicated one time from process zero to process
one. IfWhad to take a longer walk, however, it may have needed to be passed through
more processes along its path through the domain.
Coding the Application using MPI_Send and MPI_Recv
This application can be coded using MPI_Send and MPI_Recv. Before we begin
looking at code, lets establish some preliminary characteristics and functions of the
program:
Each process determines their part of the domain. Each process initializes exactlyNwalkers, all which start at the first value of
their local domain.
Each walker has two associated integer values: the current position of the walkerand the number of steps left to take.
Walkers start traversing through the domain and are passed to other processesuntil they have completed their walk.
The processes terminate when all walkers have finished.Lets begin by writing code for the domain decomposition. The function will take in the
total domain size and find the appropriate subdomain for the MPI process. It will also
give any remainder of the domain to the final process. For simplicity, I just call
MPI_Abort for any errors that are found. The function, called decompose_domain,
looks like this:
void decompose_domain(int domain_size, int world_rank,
int world_size, int* subdomain_start,
int* subdomain_size){if(world_size > domain_size){
// Don't worry about this special case. Assume the domain size
// is greater than the world size.
MPI_Abort(MPI_COMM_WORLD, 1);
}
*subdomain_start = domain_size / world_size * world_rank;
*subdomain_size = domain_size / world_size;
if(world_rank == world_size -1){
// Give remainder to last process
*subdomain_size += domain_size % world_size;
}
}
-
8/22/2019 Clase 4 - Tutorial de MPI
18/35
As you can see, the function splits the domain in even chunks, taking care of the case
when a remainder is present. The function returns a subdomain start and a subdomain
size.
Next, we need to create a function that initializes walkers. We first define a walker
structure that looks like this:
typedefstruct{
int location;
int num_steps_left_in_walk;
} Walker;
Our initialization function, called initialize_walkers, takes the subdomain bounds and
adds walkers to an incoming_walkers vector (by the way, this application is in C++).
void initialize_walkers(int num_walkers_per_proc, int max_walk_size,
int subdomain_start, int subdomain_size,vector* incoming_walkers){
Walker walker;
for(int i =0; i < num_walkers_per_proc; i++){
// Initialize walkers in the middle of the subdomain
walker.location= subdomain_start;
walker.num_steps_left_in_walk=
(rand()/(float)RAND_MAX)* max_walk_size;
incoming_walkers->push_back(walker);
}
}
After initialization, it is time to progress the walkers. Lets start off by making a
walking function. This function is responsible for progressing the walker until it hasfinished its walk. If it goes out of local bounds, it is added to the outgoing_walkers
vector.
void walk(Walker* walker, int subdomain_start, int subdomain_size,
int domain_size, vector* outgoing_walkers){
while(walker->num_steps_left_in_walk >0){
if(walker->location == subdomain_start + subdomain_size){
// Take care of the case when the walker is at the end
// of the domain by wrapping it around to the beginning
if(walker->location == domain_size){
walker->location =0;
}outgoing_walkers->push_back(*walker);
break;
}else{
walker->num_steps_left_in_walk--;
walker->location++;
}
}
}
Now that we have established an initialization function (that populates an incoming
walker list) and a walking function (that populates an outgoing walker list), we only
need two more functions: a function that sends outgoing walkers and a function thatreceives incoming walkers. The sending function looks like this:
-
8/22/2019 Clase 4 - Tutorial de MPI
19/35
void send_outgoing_walkers(vector* outgoing_walkers,
int world_rank, int world_size){
// Send the data as an array of MPI_BYTEs to the next process.
// The last process sends to process zero.
MPI_Send((void*)outgoing_walkers->data(),
outgoing_walkers->size()*sizeof(Walker), MPI_BYTE,
(world_rank +1)% world_size, 0, MPI_COMM_WORLD);// Clear the outgoing walkers
outgoing_walkers->clear();
}
The function that receives incoming walkers should use MPI_Probe since it does not
know beforehand how many walkers it will receive. This is what it looks like:
void receive_incoming_walkers(vector* incoming_walkers,
int world_rank, int world_size){
// Probe for new incoming walkers
MPI_Status status;
// Receive from the process before you. If you are process zero,// receive from the last process
int incoming_rank =
(world_rank ==0)? world_size -1: world_rank -1;
MPI_Probe(incoming_rank, 0, MPI_COMM_WORLD, &status);
// Resize your incoming walker buffer based on how much data is
// being received
int incoming_walkers_size;
MPI_Get_count(&status, MPI_BYTE, &incoming_walkers_size);
incoming_walkers->resize(incoming_walkers_size /sizeof(Walker));
MPI_Recv((void*)incoming_walkers->data(), incoming_walkers_size,
MPI_BYTE, incoming_rank, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);}
Now we have established the main functions of the program. We have to tie all these
function together as follows:
1. Initialize the walkers.2. Progress the walkers with the walkfunction.3. Send out any walkers in the outgoing_walkers vector.4. Receive new walkers and put them in the incoming_walkers vector.5. Repeat steps two through four until all walkers have finished.
The first attempt at writing this program is below. For now, we will not worry about
how to determine when all walkers have finished. Before you look at the code, I must
warn you this code is incorrect! With this in mind, lets look at my code and hopefully
you can see what might be wrong with it.
// Find your part of the domain
decompose_domain(domain_size, world_rank, world_size,
&subdomain_start, &subdomain_size);
// Initialize walkers in your subdomain
initialize_walkers(num_walkers_per_proc, max_walk_size,
subdomain_start, subdomain_size,
&incoming_walkers);
-
8/22/2019 Clase 4 - Tutorial de MPI
20/35
while(!all_walkers_finished){// Determine walker completion later
// Process all incoming walkers
for(int i =0; i < incoming_walkers.size(); i++){
walk(&incoming_walkers[i], subdomain_start, subdomain_size,
domain_size, &outgoing_walkers);
}
// Send all outgoing walkers to the next process.
send_outgoing_walkers(&outgoing_walkers, world_rank,
world_size);
// Receive all the new incoming walkers
receive_incoming_walkers(&incoming_walkers, world_rank,
world_size);
}
Everything looks normal, but the order of function calls has introduced a very likely
scenario deadlock.
Deadlock and Prevention
According to Wikipedia, deadlock refers to a specific condition when two or more
processes are each waiting for the other to release a resource, or more than two
processes are waiting for resources in a circular chain. In our case, the above code
will result in a circular chain of MPI_Send calls.
It is worth noting that the above code will actually not deadlock most of the time.
Although MPI_Send is a blocking call, the MPI specification says that MPI_Send
blocks until the send buffer can be reclaimed. This means that MPI_Send will return
when the network can buffer the message. If the sends eventually cant be buffered by
the network, they will block until a matching receive is posted. In our case, there are
enough small sends and frequent matching receives to not worry about deadlock,
however, a big enough network buffer should never be assumed.
Since we are only focusing on MPI_Send and MPI_Recv in this lesson, the best way to
avoid the possible sending and receiving deadlock is to order the messaging such thatsends will have matching receives and vice versa. One easy way to do this is to change
our loop around such that even-numbered processes send outgoing walkers before
receiving walkers and odd-numbered processes do the opposite. Given two stages of
execution, the sending and receiving will now look like this:
-
8/22/2019 Clase 4 - Tutorial de MPI
21/35
Note Executing this with one process can still deadlock. To avoid this, simply
dont perform sends and receives when using one process. You may be asking, does
this still work with an odd number of processes? We can go through a similar diagram
again with three processes:
As you can see, at all three stages, there is at least one posted MPI_Send that matches aposted MPI_Recv, so we dont have to worry about the occurrence of deadlock.
Determining Completion of All Walkers
Now comes the final step of the program determining when every single walker has
finished. Since walkers can walk for a random length, they can finish their journey on
any process. Because of this, it is difficult for all processes to know when all walkers
have finished without some sort of additional communication. One possible solution is
to have process zero keep track of all of the walkers that have finished and then tell all
the other processes when to terminate. This solution, however, is quite cumbersomesince each process would have to report any completed walkers to process zero and then
also handle different types of incoming messages.
For this lesson, we will keep things simple. Since we know the maximum distance that
any walker can travel and the smallest total size it can travel for each pair of sends and
receives (the subdomain size), we can figure out the amount of sends and receives each
process should do before termination. Using this characteristic of the program along
with our strategy to avoid deadlock, the final main part of the program looks like this:
// Find your part of the domain
decompose_domain(domain_size, world_rank, world_size,&subdomain_start, &subdomain_size);
// Initialize walkers in your subdomain
-
8/22/2019 Clase 4 - Tutorial de MPI
22/35
initialize_walkers(num_walkers_per_proc, max_walk_size,
subdomain_start, subdomain_size,
&incoming_walkers);
// Determine the maximum amount of sends and receives needed to
// complete all walkers
int maximum_sends_recvs =max_walk_size /(domain_size / world_size)+1;
for(int m =0; m < maximum_sends_recvs; m++){
// Process all incoming walkers
for(int i =0; i < incoming_walkers.size(); i++){
walk(&incoming_walkers[i], subdomain_start, subdomain_size,
domain_size, &outgoing_walkers);
}
// Send and receive if you are even and vice versa for odd
if(world_rank %2==0){
send_outgoing_walkers(&outgoing_walkers, world_rank,
world_size);
receive_incoming_walkers(&incoming_walkers, world_rank,world_size);
}else{
receive_incoming_walkers(&incoming_walkers, world_rank,
world_size);
send_outgoing_walkers(&outgoing_walkers, world_rank,
world_size);
}
}
Running the Application
The code for the application can be downloaded here. In contrast to the other lessons,this code uses C++. When installing MPICH2, you also installed the C++ MPI compiler
(unless you explicitly configured it otherwise). If you installed MPICH2 in a local
directory, make sure that you have set your MPICXX environment variable to point to
the correct mpicxx compiler in order to use my makefile.
In my code, I have set up the run script to provide default values for the program: 100
for the domain size, 500 for the maximum walk size, and 20 for the number of walkers
per process. The run script should spawn five MPI processes, and the output should
look similar to this:
-
8/22/2019 Clase 4 - Tutorial de MPI
23/35
The output continues until processes finish all sending and receiving of all walkers.
So Whats Next?
If you have made it through this entire application and feel comfortable, then good! This
application is quite advanced for a first real application. If you still dont feel
comfortable with MPI_Send, MPI_Recv, and MPI_Probe, Id recommend going
through some of the examples in my recommended books for more practice.
Next, we will start learning about collective communication in MPI, so stay tuned!
Also, at the beginning, I told you that the concepts of this program are applicable to
many parallel programs. I dont want to leave you hanging, so I have included some
additional reading material below for anyone that wishes to learn more. Enjoy
ADDITIONAL READING
Random Walking and Its Similarity to Parallel Particle Tracing
The random walk problem that we just coded, although
seemingly trivial, can actually form the basis of simulating
many types of parallel applications. Some parallel applications
in the scientific domain require many types of randomized sends
and receives. One example application is parallel particle
tracing.
Parallel particle tracing is one of the primary methods that are
used to visualize flow fields. Particles are inserted into the flow
field and then traced along the flow using numerical integration
techniques (such as Runge-Kutta). The traced paths can then be
rendered for visualization purposes. One example rendering is of the tornado image at
the top left.
-
8/22/2019 Clase 4 - Tutorial de MPI
24/35
Performing efficient parallel particle tracing can be very difficult. The main reason for
this is because the direction in which particles travel can only be determined after each
incremental step of the integration. Therefore, it is hard for processes to coordinate and
balance all communication and computation. To understand this better, lets look at a
typical parallelization of particle tracing.
-
8/22/2019 Clase 4 - Tutorial de MPI
25/35
-
8/22/2019 Clase 4 - Tutorial de MPI
26/35
In this illustration, we see that the domain is split among six process. Particles
(sometimes referred to as seeds) are then placed in the subdomains (similar to how we
placed walkers in subdomains), and then they begin tracing. When particles go out of
bounds, they have to be exchanged with processes which have the proper subdomain.
This process is repeated until the particles have either left the entire domain or have
reached a maximum trace length.
The parallel particle tracing problem can be solved with MPI_Send, MPI_Recv, and
MPI_Probe in a similar manner to our application that we just coded. There are,
however, much more sophisticated MPI routines that can get the job done more
efficiently. We will talk about these in the coming lessons
I hope you can now see at least one example of how the random walk problem is similar
to other parallel applications. Stay tuned for more lessons and applications!
MPI Broadcast and CollectiveCommunication
So far in the beginner MPI tutorial, we have examined point-to-point communication,
which is communication between two processes. This lesson is the start of the collective
communication section. Collective communication is a method of communication
which involves participation ofall processes in a communicator. In this lesson, we will
discuss the implications of collective communication and go over a standard collective
routine broadcasting. The code for the lesson can be downloaded here.
Collective Communication and Synchronization Points
One of the things to remember about collective communication is that it implies a
synchronization pointamong processes. This means that all processes must reach a
point in their code before they can all begin executing again.
Before going into detail about collective communication routines, lets examine
synchronization in more detail. As it turns out, MPI has a special function that is
dedicated to synchronizing processes:
MPI_Barrier(MPI_Comm communicator)
The name of the function is quite descriptive the function forms a barrier, and no
processes in the communicator can pass the barrier until all of them call the function.
Heres an illustration. Imagine the horizontal axis represents execution of the program
and the circles represent different processes:
-
8/22/2019 Clase 4 - Tutorial de MPI
27/35
Process zero first calls MPI_Barrier at the first time snapshot (T 1). While process zerois hung up at the barrier, process one and three eventually make it (T 2). When process
two finally makes it to the barrier (T 3), all of the processes then begin execution again
(T 4).
MPI_Barrier can be useful for many things. One of the primary uses of MPI_Barrier is
to synchronize a program so that portions of the parallel code can be timed accurately.
Want to know how MPI_Barrier is implemented? Sure you do Do you remember the
ring program from the MPI_Send and MPI_Recv tutorial? To refresh your memory, we
wrote a program that passed a token around all processes in a ring-like fashion. This
type of program is one of the simplest methods to implement a barrier since a token
cant be passed around completely until all processes work together.
One final note about synchronization Always remember that every collective call you
make is synchronized. In other words, if you cant successfully complete an
MPI_Barrier, then you also cant successfully complete any collective call. If you try to
call MPI_Barrier or other collective routines without ensuring all processes in the
communicator will also call it, your program will idle. This can be very confusing forbeginners, so be careful!
Broadcasting with MPI_Bcast
A broadcastis one of the standard collective communication techniques. During a
broadcast, one process sends the same data to all processes in a communicator. One of
the main uses of broadcasting is to send out user input to a parallel program, or send out
configuration parameters to all processes.
The communication pattern of a broadcast looks like this:
-
8/22/2019 Clase 4 - Tutorial de MPI
28/35
In this example, process zero is the rootprocess, and it has the initial copy of data. All
of the other processes receive the copy of data.
In MPI, broadcasting can be accomplished by using MPI_Bcast. The function prototype
looks like this:
MPI_Bcast(void* data, int count, MPI_Datatype datatype, introot, MPI_Comm communicator)
Although the root process and receiver processes do different jobs, they all call the same
MPI_Bcast function. When the root process (in our example, it was process zero) calls
MPI_Bcast, the data variable will be sent to all other processes. When all of the receiver
processes call MPI_Bcast, the data variable will be filled in with the data from the root
process.
Broadcasting with MPI_Send and MPI_Recv
At first, it might seem that MPI_Bcast is just a simple wrapper around MPI_Send and
MPI_Recv. In fact, we can make this wrapper function right now. Our function, called
my_bcastcan be downloaded in the example code for this lesson (my_bcast.c). It takesthe same arguments as MPI_Bcast and looks like this:
void my_bcast(void* data, int count, MPI_Datatype datatype, int root,
MPI_Comm communicator){
int world_rank;
MPI_Comm_rank(communicator, &world_rank);
int world_size;
MPI_Comm_size(communicator, &world_size);
if(world_rank == root){
// If we are the root process, send our data to everyone
int i;
for(i =0; i < world_size; i++){
if(i != world_rank){
MPI_Send(data, count, datatype, i, 0, communicator);
}
}
}else{
// If we are a receiver process, receive the data from the root
MPI_Recv(data, count, datatype, root, 0, communicator,
MPI_STATUS_IGNORE);
}
}
-
8/22/2019 Clase 4 - Tutorial de MPI
29/35
The root process sends the data to everyone else while the others receive from the root
process. Easy, right? If you download the code and run the program, the program will
print output like this:
Believe it or not, our function is actually very inefficient! Imagine that each process has
only one outgoing/incoming network link. Our function is only using one network link
from process zero to send all the data. A smarter implementation is a tree-based
communication algorithm that can use more of the available network links at once. For
example:
In this illustration, process zero starts off with the data and sends it to process one.Similar to our previous example, process zero also sends the data to process two in the
second stage. The difference with this example is that process one is now helping out
the root process by forwarding the data to process three. During the second stage, two
network connections are being utilized at a time. The network utilization doubles at
every subsequent stage of the tree communication until all processes have received the
data.
Do you think you can code this? Writing this code is a bit outside of the purpose of the
lesson. If you are feeling brave, Parallel Programming with MPI is an excellent book
with a complete example of the problem with code.
Comparison of MPI_Bcast with MPI_Send and MPI_Recv
The MPI_Bcast implementation utilizes a similar tree broadcast algorithm for good
network utilization. How does our broadcast function compare to MPI_Bcast? We can
run compare_bcast, an example program included in the lesson code. Before looking at
the code, lets first go over one of MPIs timing functions MPI_Wtime(). MPI_Wtime
takes no arguments, and it simply returns a floating-point number of seconds since a set
time in the past. Similar to Cs time function, you can call multiple MPI_Wtime
functions throughout your program and subtract their differences to obtain timing of
code segments.
-
8/22/2019 Clase 4 - Tutorial de MPI
30/35
Lets take a look of our code that compares my_bcast to MPI_Bcast
for(i =0; i < num_trials; i++){
// Time my_bcast
// Synchronize before starting timing
MPI_Barrier(MPI_COMM_WORLD);
total_my_bcast_time -= MPI_Wtime();
my_bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);
// Synchronize again before obtaining final time
MPI_Barrier(MPI_COMM_WORLD);
total_my_bcast_time += MPI_Wtime();
// Time MPI_Bcast
MPI_Barrier(MPI_COMM_WORLD);
total_mpi_bcast_time -= MPI_Wtime();
MPI_Bcast(data, num_elements, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Barrier(MPI_COMM_WORLD);
total_mpi_bcast_time += MPI_Wtime();
}
In this code, num_trials is a variable stating how many timing experiments should be
executed. We keep track of the accumulated time of both functions in two different
variables. The average times are printed at the end of the program. To see the entire
code, just download the lesson code and look at compare_bcast.c.
When you use the run script to execute the code, the output will look similar to this.
The run script executes the code using 16 processors, 100,000 integers per broadcast,
and 10 trial runs for timing results. As you can see, my experiment using 16 processors
connected via ethernet shows significant timing differences between our naive
implementation and MPIs implementation. Here is what the timing results look like at
all scales.
Processors my_bcast MPI_Bcast
2 0.0344 0.0344
4 0.1025 0.0817
8 0.2385 0.1084
16 0.5109 0.1296
As you can see, there is no difference between the two implementations at two
processors. This is because MPI_Bcasts tree implementation does not provide any
additional network utilization when using two processors. However, the differences can
clearly be observed when going up to even as little as 16 processors.
-
8/22/2019 Clase 4 - Tutorial de MPI
31/35
Try running the code yourself and experiment at larger scales!
Conclusions / Up Next
Feel a little better about collective routines? In the next MPI tutorial, I go over other
essential collective communication routines gathering and scattering.
For all beginner lessons, go the the beginner MPI tutorial.
MPI Scatter, Gather, and Allgather
In the previous lesson, we went over the essentials of collective communication. We
covered the most basic collective communication routine MPI_Bcast. In this lesson,
we are going to expand on collective communication routines by going over two very
important routines MPI_Scatter and MPI_Gather. We will also cover a variant of
MPI_Gather, known as MPI_Allgather. The code for this tutorial is available here.
An Introduction to MPI_Scatter
MPI_Scatter is a collective routine that is very similar to MPI_Bcast (If you are
unfamiliar with these terms, please read the previous lesson). MPI_Scatter involves a
designated root process sending data to all processes in a communicator. The primary
difference between MPI_Bcast and MPI_Scatter is small but important. MPI_Bcast
sends the same piece of data to all processes while MPI_Scatter sends chunks of an
array to different processes. Check out the illustration below for further clarification.
In the illustration, MPI_Bcast takes a single data element at the root process (the redbox) and copies it to all other processes. MPI_Scatter takes an array of elements and
distributes the elements in the order of process rank. The first element (in red) goes to
process zero, the second element (in green) goes to process one, and so on. Although
-
8/22/2019 Clase 4 - Tutorial de MPI
32/35
the root process (process zero) contains the entire array of data, MPI_Scatter will copy
the appropriate element into the receiving buffer of the process. Here is what the
function prototype of MPI_Scatter looks like.
MPI_Scatter(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatyperecv_datatype, int root, MPI_Comm communicator)
Yes, the function looks big and scary, but lets examine it in more detail. The first
parameter, send_data, is an array of data that resides on the root process. The second
and third parameters, send_countand send_datatype, dictate how many elements of a
specific MPI Datatype will be sent to each process. Ifsend_countis one and
send_datatype is MPI_INT, then process zero gets the first integer of the array, process
one gets the second integer, and so on. Ifsend_countis two, then process zero gets the
first and second integers, process one gets the third and fourth, and so on. In practice,
send_countis often equal to the number of elements in the array divided by the number
of processes. Whats that you say? The number of elements isnt divisible by thenumber of processes? Dont worry, we will cover that in a later lesson
The receiving parameters of the function prototype are nearly identical in respect to the
sending parameters. The recv_data parameter is a buffer of data that can hold
recv_countelements that have a datatype ofrecv_datatype. The last parameters, root
and communicator, indicate the root process that is scattering the array of data and the
communicator in which the processes reside.
An Introduction to MPI_Gather
MPI_Gather is the inverse of MPI_Scatter. Instead of spreading elements from oneprocess to many processes, MPI_Gather takes elements from many processes and
gathers them to one single process. This routine is highly useful to many parallel
algorithms, such as parallel sorting and searching. Below is a simple illustration of this
algorithm.
Similar to MPI_Scatter, MPI_Gather takes elements from each process and gathers
them to the root process. The elements are ordered by the rank of the process from
which they were received. The function prototype for MPI_Gather is identical to that of
MPI_Scatter.
MPI_Gather(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatyperecv_datatype, int root, MPI_Comm communicator)
-
8/22/2019 Clase 4 - Tutorial de MPI
33/35
In MPI_Gather, only the root process needs to have a valid receive buffer. All other
calling processes can pass NULL for recv_data. Also, dont forget that the recv_count
parameter is the count of elements receivedper process, not the total summation of
counts from all processes. This can often confuse beginning MPI programmers.
Computing Average of Numbers with MPI_Scatter and MPI_Gather
In the code for this lesson, I have provided an example program that computes the
average across all numbers in an array. The program is in avg.c. Although the program
is quite simple, it demonstrates how one can use MPI to divide work across processes,
perform computation on subsets of data, and then aggregate the smaller pieces into the
final answer. The program takes the following steps:
1. Generate a random array of numbers on the root process (process 0).2. Scatter the numbers to all processes, giving each process an equal amount of
numbers.
3. Each process computes the average of their subset of the numbers.4. Gather all averages to the root process. The root process then computes the
average of these numbers to get the final average.
The main part of the code with the MPI calls looks like this:
if(world_rank ==0){
rand_nums = create_rand_nums(elements_per_proc * world_size);
}
// Create a buffer that will hold a subset of the random numbers
float*sub_rand_nums =malloc(sizeof(float)* elements_per_proc);
// Scatter the random numbers to all processes
MPI_Scatter(rand_nums, elements_per_proc, MPI_FLOAT, sub_rand_nums,
elements_per_proc, MPI_FLOAT, 0, MPI_COMM_WORLD);
// Compute the average of your subset
float sub_avg = compute_avg(sub_rand_nums, elements_per_proc);
// Gather all partial averages down to the root process
float*sub_avgs =NULL;
if(world_rank ==0){
sub_avgs =malloc(sizeof(float)* world_size);
}
MPI_Gather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT, 0,
MPI_COMM_WORLD);
// Compute the total average of all numbers.
if(world_rank ==0){
float avg = compute_avg(sub_avgs, world_size);
}
At the beginning of the code, the root process creates an array of random numbers.
When MPI_Scatter is called, each process now contains elements_per_proc elements of
the original data. Each process computes the average of their subset of data and then the
root process gathers each individual average. The total average is computed on this
much smaller array of numbers.
-
8/22/2019 Clase 4 - Tutorial de MPI
34/35
Using the run script included in the code for this lesson, the output of your program
should be similar to the following. Note that the numbers are randomly generated, so
your final result might be different from mine.
MPI_Allgather and Modification of Average Program
So far, we have covered two MPI routines that perform many-to-one or one-to-many
communication patterns, which simply means that many processes send/receive to one
process. Oftentimes it is useful to be able to send many elements to many processes (i.e.
a many-to-many communication pattern). MPI_Allgather has this characteristic.
Given a set of elements distributed across all processes, MPI_Allgather will gather all
of the elements to all the processes. In the most basic sense, MPI_Allgather is an
MPI_Gather followed by an MPI_Bcast. The illustration below shows how data is
distributed after a call to MPI_Allgather.
Just like MPI_Gather, the elements from each process are gathered in order of their
rank, except this time the elements are gathered to all processes. Pretty easy, right? The
function declaration for MPI_Allgather is almost identical to MPI_Gather with the
difference that there is no root process in MPI_Allgather.
MPI_Allgather(void* send_data, int send_count, MPI_Datatypesend_datatype, void* recv_data, int recv_count, MPI_Datatype
recv_datatype, MPI_Comm communicator)
I have modified the average computation code to use MPI_Allgather. You can view the
source in all_avg.c from the lesson code. The main difference in the code is shown
below.
// Gather all partial averages down to all the processes
float*sub_avgs =(float*)malloc(sizeof(float)* world_size);
MPI_Allgather(&sub_avg, 1, MPI_FLOAT, sub_avgs, 1, MPI_FLOAT,
MPI_COMM_WORLD);
-
8/22/2019 Clase 4 - Tutorial de MPI
35/35
// Compute the total average of all numbers.
float avg = compute_avg(sub_avgs, world_size);
The partial averages are now gathered to everyone using MPI_Allgather. The averages
are now printed off from all of the processes. Example output of the program should
look like the following:
As you may have noticed, the only difference between all_avg.c and avg.c is thatall_avg.c prints the average across all processes with MPI_Allgather.
Up Next
In the next lesson, I will cover some of the more complex collective communication
algorithms. Stay tuned! Feel free to leave any comments or questions about the lesson.
For all beginner lessons, go the the beginner MPI tutorial.