Difference between revisions of "HowTo:mpi"

From CAC Wiki
Jump to: navigation, search
(Created page with "=MPI (Message Passing Interface)= This is a short introduction to the Message Passing Interface (MPI) system that was designed to enable parallel programming by communication...")
 
(Implementations)
 
(31 intermediate revisions by the same user not shown)
Line 1: Line 1:
=MPI (Message Passing Interface)=
+
= MPI (Message Passing Interface) =
  
This is a short introduction to the Message Passing Interface (MPI) system that was designed to enable parallel programming by communication on distributed-memory machines. MPI has become a standard for multiple-processor programming of code that runs on a variety of machines, from small Beowulf installations to shared-memory high-performance machines and hybrid clusters such as the ones at HPCVL. Due to the complexity of the system, it is of course impossible to give a detailed guide or even tutorial to MPI-programming in the framework of a "Frequently Asked Questions" file. This is not an introduction to MPI programming. References and links for further details are given.
+
This is a short introduction to the Message Passing Interface (MPI) system that was designed to enable parallel programming by communication on distributed-memory machines. MPI has become a standard for multiple-processor programming of code that runs on a variety of machines, from small Beowulf installations to shared-memory high-performance machines and hybrid clusters such as the ones at the Centre for Advance Computing. Due to the complexity of the system, it is of course impossible to give a detailed guide or even tutorial to MPI-programming in the framework of a "Frequently Asked Questions" file. This is not an introduction to MPI programming. References and links for further details are given.
  
==What is the Message Passing Interface (MPI)?==
+
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#e1eaf1; border-radius:7px" |
 +
== What is the Message Passing Interface (MPI) ? ==
  
 
The Message Passing Interface is a communication system that was designed by a group of researchers to supply programmers with a standard for distributed-memory parallel programming that is portable and usable on a variety of platforms. The standardization process was initiated at a Workshop in 1992, and a draft of the standard was available a year later. Version 1.0 was released in the Summer of 1994. The current version of MPI is 3.0.
 
The Message Passing Interface is a communication system that was designed by a group of researchers to supply programmers with a standard for distributed-memory parallel programming that is portable and usable on a variety of platforms. The standardization process was initiated at a Workshop in 1992, and a draft of the standard was available a year later. Version 1.0 was released in the Summer of 1994. The current version of MPI is 3.0.
Line 28: Line 30:
 
* One-sided operations
 
* One-sided operations
  
==What kind of system uses MPI?==
+
== What kind of system uses MPI ? ==
  
MPI was designed to handle distributed-memory systems, i.e. clusters. This does not mean that its usage is restricted to such systems. In fact, there are good arguments for MPI running even better on shared-memory machines. It will of course also be usable on hybrids such as the HPCVL Sunfire cluster.
+
MPI was designed to handle distributed-memory systems, i.e. clusters. This does not mean that its usage is restricted to such systems. In fact, there are good arguments for MPI running even better on shared-memory machines. It will of course also be usable on hybrids such as our clusters.
  
 
Since MPI provides a means to enable communication between different CPUs, it does not depend on shared-memory architectures, as is the case for multi-threading systems such as OpenMP. On the other hand, it can make use of shared memory for fast amd improved communication.
 
Since MPI provides a means to enable communication between different CPUs, it does not depend on shared-memory architectures, as is the case for multi-threading systems such as OpenMP. On the other hand, it can make use of shared memory for fast amd improved communication.
  
 
Note that the status of MPI as a distributed-memory system implies that multiple processes are started from the beginning and run, usually on different CPUs to completion. These processes do not have anything in common, and each has its own memory space. Any information exchange requires communication of data, for which MPI was designed.
 
Note that the status of MPI as a distributed-memory system implies that multiple processes are started from the beginning and run, usually on different CPUs to completion. These processes do not have anything in common, and each has its own memory space. Any information exchange requires communication of data, for which MPI was designed.
 +
|}
  
==How is MPI used?==
+
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#f7f7f; border-radius:7px" |
 +
 
 +
== Using MPI ==
  
 
MPI is a set of subroutines which are used explicitly to communicate between processes. As such, MPI programs are truly "multi-processing". Parallelisation can not be done automatically or semi-automatically as in "multi-threading" programs. Instead, function and subroutine calls have to be inserted into the code and form an integral part of the program. Often it is beneficial to alter the algorithm of the code with respect to the serial version.
 
MPI is a set of subroutines which are used explicitly to communicate between processes. As such, MPI programs are truly "multi-processing". Parallelisation can not be done automatically or semi-automatically as in "multi-threading" programs. Instead, function and subroutine calls have to be inserted into the code and form an integral part of the program. Often it is beneficial to alter the algorithm of the code with respect to the serial version.
Line 49: Line 55:
 
* MPI programs also usually need a special runtime environment to be executed properly. This is commonly supplied by the machine vendor and is machine specific.
 
* MPI programs also usually need a special runtime environment to be executed properly. This is commonly supplied by the machine vendor and is machine specific.
  
==Can you give me an example of MPI?==
+
== An Example ==
  
 
The working principle of MPI is perhaps best illustrated on the grounds of a programming example. The following program, written in Fortran 90 computes the sum of all square-roots of integers from 0 up to a specific limit m:
 
The working principle of MPI is perhaps best illustrated on the grounds of a programming example. The following program, written in Fortran 90 computes the sum of all square-roots of integers from 0 up to a specific limit m:
<code>
 
  
 +
<pre>
 
  module mpi  
 
  module mpi  
 
   include 'mpif.h'  
 
   include 'mpif.h'  
Line 102: Line 108:
 
   return  
 
   return  
 
  end  
 
  end  
 
+
</pre>
</code>  
+
  
 
Some of the common tasks that need to be performed in every MPI program are done in the subroutine mpiinit in this program. Namely, we need to call the routine ''mpi_init'' to prepare the usage of MPI. This has to be done before any other MPI routine is called. The two routine calls to ''mpi_comm_size'' and ''call mpi_comm_rank'' determine how many processes are running and what is the unique ID number of the present, i.e. the calling process. Both pieces of information are essential. The results are stored in the variables ''totps'' and ''myid'', respectively. Note that these variables appear in a module ''cpuids'' so that they may be accessed from all routines that "use" that module.
 
Some of the common tasks that need to be performed in every MPI program are done in the subroutine mpiinit in this program. Namely, we need to call the routine ''mpi_init'' to prepare the usage of MPI. This has to be done before any other MPI routine is called. The two routine calls to ''mpi_comm_size'' and ''call mpi_comm_rank'' determine how many processes are running and what is the unique ID number of the present, i.e. the calling process. Both pieces of information are essential. The results are stored in the variables ''totps'' and ''myid'', respectively. Note that these variables appear in a module ''cpuids'' so that they may be accessed from all routines that "use" that module.
Line 118: Line 123:
  
 
In this simple example, we have distributed the tasks of computing many square roots among processes, each of which only did a part of the work. We used communication to exchange information about the tasks that needed to be performed, and to collect results. This mode of programming is called "task parallel". Often it is necessary to distribute large amounts of data among processes as well, leading to "data parallel" programs. Of course, the distinction is not always clear.
 
In this simple example, we have distributed the tasks of computing many square roots among processes, each of which only did a part of the work. We used communication to exchange information about the tasks that needed to be performed, and to collect results. This mode of programming is called "task parallel". Often it is necessary to distribute large amounts of data among processes as well, leading to "data parallel" programs. Of course, the distinction is not always clear.
 +
|}
  
==How is MPI implemented on HPCVL machines?==
+
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#e1eaf1; border-radius:7px" |
 +
 
 +
== Implementations ==
  
 
While MPI itself is a portable, platform independent standard, much like a programming language, the actual implementation is necessarily platform dependent since it has to take into account the architecture of the machine or cluster in question.
 
While MPI itself is a portable, platform independent standard, much like a programming language, the actual implementation is necessarily platform dependent since it has to take into account the architecture of the machine or cluster in question.
  
HPCVL machines are mostly shared-memory machines that form a cluster. The OpenMPI implementation in ClusterTools makes use of this structure by exploiting the rapid access to a common memory space for MPI communication. This is done by means of a so-called "shared-memory layer". As a result, communication between CPUs on the same shared-memory node are much faster and more reliable than between nodes. Our cluster is therefore configured in to preferably schedule processes within a single node.
+
The most commonly used implementation of MPI for the Linux platform is called '''OpenMPI'''. The following considerations will be focussed on this implementation.
 +
 
 +
Our machines are small to mid-sized shared-memory machines that form a cluster. Since the interconnect between the individual nodes is a bottleneck in efficient program execution, most of the MPI programs running on our machines are executed within a node. This alloows processes to commuincate rapidly through a so-called "shared-memory layer". Our cluster is configured in to preferably schedule processes within a single node.
 +
 
 +
Currently, the OpenMPI parallel environment used on our systems is supplied via the Compute Canada "CVMFS stack" and is available without any specific setup after login:
 +
 
 +
<pre>
 +
$ which mpirun
 +
/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/intel2016.4/openmpi/2.1.1/bin/mpirun
 +
 
 +
$ mpirun --version
 +
mpirun (Open MPI) 2.1.1
 +
</pre>
  
The MPI libraries, headers, etc. reside in the ''/opt/SUNWhpc'' directory sub-tree. It is usually not necessary to include the proper directories in the ''PATH'' variable as this is done by default by the system. The ''/opt/SUNWhpc'' directory includes several versions (6, 7, and 8) of the Sun ClusterTools (CT) environment which enables the compiling and running of multi-process programs.
+
Other (older) versions are available through a standard module setup:
 +
<pre>
 +
$ module avail openmpi
 +
-------------------------------------- Compiler-dependent avx2 modules ---------------------------------------
 +
  openmpi/1.6.5 (m)    openmpi/1.10.7 (m)    openmpi/2.1.1 (L,m,D)   openmpi/1.8.8 (m)   openmpi/2.0.2  (m)
  
The current default version of ClusterTools is 8.1 which uses OpenMPI as its MPI implementation. For compatibility, we keep older versions of ClusterTools, such as the Sun MPI based CT6. This will not be discontinued in the future.
+
$ module load openmpi/1.8.8
 +
The following have been reloaded with a version change:
 +
  1) openmpi/2.1.1 => openmpi/1.8.8
  
If possible run MPI jobs using the default version of ClusterTools. This version is based on OpenMPI which is an OpenSource version of MPI and therefore offers a greater degree of compatibility with other platforms.
+
$ mpirun --version
 +
mpirun (Open MPI) 1.8.8
 +
</pre>
  
==How do I compile MPI code on HPCVL?==
+
== Compiling MPI code ==
  
The compilation of MPI programs requires a few compiler options to direct the compiler to the location of header files and libraries. Since these switches are always the same, they have been collected in a macro to avoid unnecessary typing. The macro is has an mpi prefix before the normal compiler name. The commands are '''mpif90''' for Fortran 90 compiler, '''mpicc''' and '''mpiCC''' for the C and C++ compilers, respectively. For instance, if a serial C program is compiled by
+
The compilation of MPI programs requires a few compiler options to direct the compiler to the location of header files and libraries. Since these switches are always the same, they have been collected in a macro to avoid unnecessary typing. The macro is has an mpi prefix before the normal compiler name. The commands are '''mpif90''' for the Fortran compiler and '''mpicc''' for the C compiler, respectively. For instance, if a serial C program is compiled by
  
:cc -xO3 -c test.c
+
<pre>icc -O3 -c test.c</pre>
  
the corresponding parallel (MPI) program is compiled by
+
the corresponding parallel (MPI) program is compiled (using gnu compiler) by
  
:mpicc -xO3 -c test_mpi.c
+
<pre>mpicc -O3 -c test_mpi.c</pre>
  
In the linking stage, the specification of the MPI libraries is not required as it is implied in the macro. For example, the above MPI program should be linked with something like this:
+
In the linking stage, the usage of '''mpi*''' macros also includes the proper specification of the MPI libraries. For example, the above MPI program should be linked with something like this:
  
:mpicc -o test_mpi.exe test_mpi.o  
+
<pre>mpicc -o test_mpi.exe test_mpi.o</pre>
  
 
Compiling and linking may also be combined by omitting the ''-c'' option and including the naming option (''-o'') in the compilation line.
 
Compiling and linking may also be combined by omitting the ''-c'' option and including the naming option (''-o'') in the compilation line.
  
==How do I run MPI programs on HPCVL computers?==
+
Here are the corresponding MPI macros for the commonly used compilers on our systems:
  
To run MPI programs, a special Runtime Environment is required. This differs from platform to platform. On our Sun machines, which run Solaris, it is called ClusterToolsor CT. This includes commands for the control of multi-process jobs.  
+
{| class="wikitable sortable" border="1" cellpadding="2" cellspacing="0"
 +
|'''Language'''
 +
|'''serial'''
 +
|'''MPI'''
 +
|'''default version'''
 +
|-
 +
|''Fortran''
 +
| ifort
 +
| mpif90
 +
| 16.0.4
 +
|-
 +
|''C''
 +
| icc
 +
| mpicc
 +
| 16.0.4
 +
|-       
 +
|''C++'' 
 +
| icpc, icc
 +
| mpicxx
 +
| 16.0.4
 +
|}
 +
 
 +
== Running MPI programs ==
 +
 
 +
To run MPI programs, a special Runtime Environment is required. This includes commands for the control of multi-process jobs.  
  
 
'''mpirun''' is used to start a multi-process run of a program. This required to run MPI programs. The most commonly used command line option is '''-np''' to specify the number of processes to be started. For instance, the following line will start the program ''test_mpi.exe'' with 9 processes:
 
'''mpirun''' is used to start a multi-process run of a program. This required to run MPI programs. The most commonly used command line option is '''-np''' to specify the number of processes to be started. For instance, the following line will start the program ''test_mpi.exe'' with 9 processes:
  
: mpirun -np 9 test_mpi.exe  
+
<pre>mpirun -np 9 test_mpi.exe</pre>
  
The mpirun commands offer additional options that are sometimes useful or required. Most tend to interfere with the scheduling of jobs in a multi-user environment such as ours and should be used with caution. Please consult the man pages for details.
+
The mpirun command offers additional options that are sometimes useful or required. Most tend to interfere with the scheduling of jobs in a multi-user environment such as ours and should be used with caution. Please consult the man pages for details.
  
Note that the usage of Gridengine is mandatory for production jobs on our system. This option is therefore used frequently. For a details about Gridengine and jobs submission on HPCVL machines and clusters, go here.
+
Note that the usage of [[HowTo:Scheduler|a scheduler]] is mandatory for production jobs on our system. This option is therefore used frequently. For a details about Gridengine and jobs submission on our machines and clusters, [[HowTo:Scheduler|go here]].
 +
|}
  
Note that the usage mpirun in Grid Engine does not require the specification of the number of processes through the ''-np'' option, as the mpirun command detects usage through Grid Engine automatically. This brings down the command line to a simple
+
{|  style="border-spacing: 8px;"
 +
| valign="top" width="50%" style="padding:1em; border:1px solid #aaaaaa; background-color:#f7f7f7; border-radius:7px" |
  
: mpirun program.exe
+
== More Information ==
  
==Where can I learn details about MPI?==
+
This is a very short overview to get you started. If you need a more detailed help file, try the Compute Canada docs at https://docs.computecanada.ca/wiki/MPI .
 +
Since it is based on the same software stack that we are using at the Centre for Advanced Computing, it is fully compatible.
  
 
As already pointed out, this FAQ is not an introduction to MPI programming. The standard reference text on MPI is:
 
As already pointed out, this FAQ is not an introduction to MPI programming. The standard reference text on MPI is:
Line 173: Line 229:
 
This text specifies all MPI routines and concepts, and includes a large number of examples. Most people will find it sufficient for all their needs.
 
This text specifies all MPI routines and concepts, and includes a large number of examples. Most people will find it sufficient for all their needs.
  
[http://www.mhpcc.edu/training/workshop/mpi/MAIN.html A quite good online tutorial for MPI programming] can be found at the Maui HPCC site.
+
[http://www.mcs.anl.gov/research/projects/mpi/tutorial/gropp/talk.html A quite good online tutorial for MPI programming] can be found at the Argonne National Lab site.
 
+
For Sun and Solaris specific questions, including CRE and Sun MPI, visit the [http://www.oracle.com/technetwork/indexes/documentation/index.html Oracle Documentation Site] and use their Search Engine to look for "Sun MPI", "CRE", and "ClusterTools".
+
  
 
There is also an [http://www.mpi-forum.org/ official MPI webpage] which contains the standards documents for MPI and gives access to the MPI Forum.  
 
There is also an [http://www.mpi-forum.org/ official MPI webpage] which contains the standards documents for MPI and gives access to the MPI Forum.  
  
HPCVL also organizes [[Training:WorkshopsUpcoming|Workshops on a regular basis]], and one of them is devoted to basic MPI programming. They are announced on [http://www.hpcvl.org our web site]. We might see you there sometime soon.
+
We are conducting [[Training:Workshops|Workshops on a regular basis]], some devoted to MPI programming. They are announced on [http://caca.queensu.ca our web site]. We might see you there sometime soon.
  
==Are there any tools helping me with MPI programming?==
+
== Some Tools ==
  
Standard debugging and profiling tools such as Sun Studio are designed for serial or multi-threaded programs. They do not handle multi-process runs very well.
+
Standard debugging and profiling tools for serial or multi-threaded programs. They do not handle multi-process runs very well.
  
 
Quite often, the best way to check the performance of an MPI program is timing it by insertion of suitable routines. MPI supplies a "wall-clock" routine called ''MPI_WTIME()'', that lets you determine how much actual time was spent in a specific segment of your code. An other method is calling the subroutines ''ETIME'' and ''DTIME'', which can give you information about the actual CPU time used. However, it is advisable to carefully read the documentation before using them with MPI programs. In this case, refer to the [http://docs.oracle.com/cd/E19205-01/819-5259/ Sun Studio 12: Fortran Library Reference].
 
Quite often, the best way to check the performance of an MPI program is timing it by insertion of suitable routines. MPI supplies a "wall-clock" routine called ''MPI_WTIME()'', that lets you determine how much actual time was spent in a specific segment of your code. An other method is calling the subroutines ''ETIME'' and ''DTIME'', which can give you information about the actual CPU time used. However, it is advisable to carefully read the documentation before using them with MPI programs. In this case, refer to the [http://docs.oracle.com/cd/E19205-01/819-5259/ Sun Studio 12: Fortran Library Reference].
  
HPCVL also provides a package called the [[Software:HWT|HPCVL Working Template (HWT)]], which was created by Gang Liu. The HWT provides 3 main functionalities:
+
== Help ==
 
+
[mailto:cac.help@queensu.ca Send email to cac.help@queensu.ca]. We have scientific programmers on staff who will probably be able to help you out. Of course, we can't do the coding for you but we do our best to get your code ready for parallel machines and clusters.
* '''Maintenance of multiple versions''' of the same code from a single source file. This is very useful, if your MPI code is based on a serial code that you want to convert.
+
|}
* '''Automatic Relative Debugging''' which allows you to use pre-existing code (for example the serial version of your program) as a reference when checking the correctness of your MPI code.
+
* '''Simple Timing''' which is needed to determine bottlenecks for parallelization, to optimize code, and to check its scaling properties.
+
 
+
The HWT is based on libraries and script files. It is easy to use and portable (written largely in Fortran). Fortran, C, C++, and any mixture thereof are supported, as well as MPI and OpenMP for parallelism. [http://hpcvl.org/sites/default/files/hpcvl%20HWTmanual_1.pdf Documentation of the HWT is available]. The package is installed on our M9000 cluster in /opt/hwt.
+
 
+
==It doesn't work. Where can I get help?==
+
 
+
Most of the Sun specific issues addressed in this FAQ are documented at [http://www.oracle.com/technetwork/indexes/documentation/index.html the Oracle documentation site]. The search engine can be used to find specific documents. You can [mailto:help@hpcvl.org send email] or [[Contacts:UserSupport|contact of our support staff]], who include several scientific programmers. Keep in mind that we support many people at any given time, so we cannot do the coding for you. But we can do my best to help you get your code ready for multi-processor machines.
+

Latest revision as of 13:45, 16 August 2018

MPI (Message Passing Interface)

This is a short introduction to the Message Passing Interface (MPI) system that was designed to enable parallel programming by communication on distributed-memory machines. MPI has become a standard for multiple-processor programming of code that runs on a variety of machines, from small Beowulf installations to shared-memory high-performance machines and hybrid clusters such as the ones at the Centre for Advance Computing. Due to the complexity of the system, it is of course impossible to give a detailed guide or even tutorial to MPI-programming in the framework of a "Frequently Asked Questions" file. This is not an introduction to MPI programming. References and links for further details are given.

What is the Message Passing Interface (MPI) ?

The Message Passing Interface is a communication system that was designed by a group of researchers to supply programmers with a standard for distributed-memory parallel programming that is portable and usable on a variety of platforms. The standardization process was initiated at a Workshop in 1992, and a draft of the standard was available a year later. Version 1.0 was released in the Summer of 1994. The current version of MPI is 3.0.

At present, the system is implemented in the form of "bindings" for Fortran, C, and C++. MPI-1 (i.e. the original version 1.0 of the standard) consisted of about 100 functions/subroutines which form the MPI core and were available for Fortran and C. MPI-2 approximately doubles the number of routines, adding new features (which we will not discuss in this FAQ), and introducing C++ bindings.

The number of essential MPI routines is quite limited. In our experience, well-functioning MPI programs can be written with a dozen or so. The routines are supplied in the form of libraries, header files and usually scripts for easy compilation. Almost any hardware/OS platform supplies a native MPI version these days, and public-domain versions are available as well.

Here is an overview of the main components of MPI-1:

  • Bindings for C and Fortran
  • Point-to-point communication
  • Collective operations
  • Process groups and communication domains
  • User-defined data types and packing
  • Process topologies
  • Management, Inquiry, and Profiling

MPI-2 adds to this:

  • C++ bindings
  • Dynamic processes
  • Parallel I/O
  • One-sided operations

What kind of system uses MPI ?

MPI was designed to handle distributed-memory systems, i.e. clusters. This does not mean that its usage is restricted to such systems. In fact, there are good arguments for MPI running even better on shared-memory machines. It will of course also be usable on hybrids such as our clusters.

Since MPI provides a means to enable communication between different CPUs, it does not depend on shared-memory architectures, as is the case for multi-threading systems such as OpenMP. On the other hand, it can make use of shared memory for fast amd improved communication.

Note that the status of MPI as a distributed-memory system implies that multiple processes are started from the beginning and run, usually on different CPUs to completion. These processes do not have anything in common, and each has its own memory space. Any information exchange requires communication of data, for which MPI was designed.

Using MPI

MPI is a set of subroutines which are used explicitly to communicate between processes. As such, MPI programs are truly "multi-processing". Parallelisation can not be done automatically or semi-automatically as in "multi-threading" programs. Instead, function and subroutine calls have to be inserted into the code and form an integral part of the program. Often it is beneficial to alter the algorithm of the code with respect to the serial version.

The need to include the parallelism explicitly in the program is both a curse and a blessing: while it means more work and requires more planning than multi-threading, it also often leads to more reliable and scalable code since the behaviour of the latter is in the hands of the programmer. Well-written MPI codes can be made to scale for thousands of CPUs.

To create an MPI program, you need to:

  • Include appropriate header files for the definitions of variables and data structures. These are called mpif.h, mpi.h, and mpi++.h for Fortran, C, and C++, respectively.
  • Program the communication between processes in the form of calls to the MPI communication routines. These are commonly of the form MPI_* for Fortran and C, and MPI::* for C++.
  • Bind in the proper libraries at the linking stage of program compilation. This is usually done with the -lmpi option of the compiler/linker.
  • MPI programs also usually need a special runtime environment to be executed properly. This is commonly supplied by the machine vendor and is machine specific.

An Example

The working principle of MPI is perhaps best illustrated on the grounds of a programming example. The following program, written in Fortran 90 computes the sum of all square-roots of integers from 0 up to a specific limit m:

 module mpi 
  include 'mpif.h' 
 end module mpi 

 module cpuids 
  integer::myid,totps, ierr 
 end module cpuids 

 program example02 
  use mpi 
  use cpuids 
  call mpiinit 
  call demo02 
  call mpi_finalize(ierr) 
  stop 
 end 

 subroutine mpiinit 
  use mpi
  use cpuids 
  call mpi_init( ierr ) 
  call mpi_comm_rank(mpi_comm_world,myid,ierr) 
  call mpi_comm_size(mpi_comm_world,totps,ierr) 
  return 
 end 

 subroutine demo02 
  use mpi 
  use cpuids 
  integer:: m, i 
  real*8 :: s, mys 
  if(myid.eq.0) then 
   write(*,*)'how many terms?' 
   read(*,*) m 
  end if 
  call mpi_bcast(m,1,mpi_integer,0,mpi_comm_world,ierr) 
  mys=0.0d0 
  do i=myid,m,totps 
   mys=mys+dsqrt(dfloat(i)) 
  end do 
  write(*,*)'rank:', myid,'mys=',mys, ' m:',m 
  s=0.0d0 
  call mpi_reduce(mys,s,1,mpi_real8,mpi_sum,0,mpi_comm_world,ierr) 
  if(myid.eq.0) then 
   write(*,*)'total sum: ', s 
  end if 
  return 
 end 

Some of the common tasks that need to be performed in every MPI program are done in the subroutine mpiinit in this program. Namely, we need to call the routine mpi_init to prepare the usage of MPI. This has to be done before any other MPI routine is called. The two routine calls to mpi_comm_size and call mpi_comm_rank determine how many processes are running and what is the unique ID number of the present, i.e. the calling process. Both pieces of information are essential. The results are stored in the variables totps and myid, respectively. Note that these variables appear in a module cpuids so that they may be accessed from all routines that "use" that module.

The main work in the example is done in the subroutine demo02. Note that this routine does use the module cpuids. The first operation is to determine the maximum integer m in the sum by requesting input from the user. The if-clause if(myid.eq.0) then serves to restrict this I/O operation to only one process, the so-called "root process", usually chosen to be the one with rank (i.e. unique ID number) zero.

After this initial operation, communication has become necessary, since only one process has the right value of m. This is done by a call to the MPI collective operation routine mpi_bcast. This call has the effect of "broadcasting" the integer m. This call needs to be made by all processes, and after they have done so, all of them know m.

The sum over the square root is then executed on each process in a slightly different manner. Each term is added to a local variable mys. A stride of totps (the number of processes) in the do-loop ensures that each process adds different terms to its local sum, by skipping all others. For instance, if there are 10 processes, process 0 will add the square-roots of 0,10,20,30,..., while process 7 will add the square-roots of 7,17,27,37,...

After the sums have been completed, further communication is necessary, since each process only has computed a partial, local sum. We need to collect these local sums into one total, and we do so by calling mpi_reduce. The effect of this call is to "reduce" a value local to each process to a variable that is local to only one process, usually the root process. We can do this in various ways, but in our case we choose to sum the values up by specifying mpi_sum in the function call. Afterwards, the total sum resides in the variable s, which is printed out by the root process.

The last operation done in our example is finalizing MPI usage by a call to mpi_finalize, which is necessary for proper program completion.

In this simple example, we have distributed the tasks of computing many square roots among processes, each of which only did a part of the work. We used communication to exchange information about the tasks that needed to be performed, and to collect results. This mode of programming is called "task parallel". Often it is necessary to distribute large amounts of data among processes as well, leading to "data parallel" programs. Of course, the distinction is not always clear.

Implementations

While MPI itself is a portable, platform independent standard, much like a programming language, the actual implementation is necessarily platform dependent since it has to take into account the architecture of the machine or cluster in question.

The most commonly used implementation of MPI for the Linux platform is called OpenMPI. The following considerations will be focussed on this implementation.

Our machines are small to mid-sized shared-memory machines that form a cluster. Since the interconnect between the individual nodes is a bottleneck in efficient program execution, most of the MPI programs running on our machines are executed within a node. This alloows processes to commuincate rapidly through a so-called "shared-memory layer". Our cluster is configured in to preferably schedule processes within a single node.

Currently, the OpenMPI parallel environment used on our systems is supplied via the Compute Canada "CVMFS stack" and is available without any specific setup after login:

$ which mpirun
/cvmfs/soft.computecanada.ca/easybuild/software/2017/avx2/Compiler/intel2016.4/openmpi/2.1.1/bin/mpirun

$ mpirun --version
mpirun (Open MPI) 2.1.1

Other (older) versions are available through a standard module setup:

$ module avail openmpi
-------------------------------------- Compiler-dependent avx2 modules ---------------------------------------
   openmpi/1.6.5 (m)    openmpi/1.10.7 (m)    openmpi/2.1.1 (L,m,D)   openmpi/1.8.8 (m)    openmpi/2.0.2  (m)

$ module load openmpi/1.8.8
The following have been reloaded with a version change:
  1) openmpi/2.1.1 => openmpi/1.8.8

$ mpirun --version
mpirun (Open MPI) 1.8.8

Compiling MPI code

The compilation of MPI programs requires a few compiler options to direct the compiler to the location of header files and libraries. Since these switches are always the same, they have been collected in a macro to avoid unnecessary typing. The macro is has an mpi prefix before the normal compiler name. The commands are mpif90 for the Fortran compiler and mpicc for the C compiler, respectively. For instance, if a serial C program is compiled by

icc -O3 -c test.c

the corresponding parallel (MPI) program is compiled (using gnu compiler) by

mpicc -O3 -c test_mpi.c

In the linking stage, the usage of mpi* macros also includes the proper specification of the MPI libraries. For example, the above MPI program should be linked with something like this:

mpicc -o test_mpi.exe test_mpi.o

Compiling and linking may also be combined by omitting the -c option and including the naming option (-o) in the compilation line.

Here are the corresponding MPI macros for the commonly used compilers on our systems:

Language serial MPI default version
Fortran ifort mpif90 16.0.4
C icc mpicc 16.0.4
C++ icpc, icc mpicxx 16.0.4

Running MPI programs

To run MPI programs, a special Runtime Environment is required. This includes commands for the control of multi-process jobs.

mpirun is used to start a multi-process run of a program. This required to run MPI programs. The most commonly used command line option is -np to specify the number of processes to be started. For instance, the following line will start the program test_mpi.exe with 9 processes:

mpirun -np 9 test_mpi.exe

The mpirun command offers additional options that are sometimes useful or required. Most tend to interfere with the scheduling of jobs in a multi-user environment such as ours and should be used with caution. Please consult the man pages for details.

Note that the usage of a scheduler is mandatory for production jobs on our system. This option is therefore used frequently. For a details about Gridengine and jobs submission on our machines and clusters, go here.

More Information

This is a very short overview to get you started. If you need a more detailed help file, try the Compute Canada docs at https://docs.computecanada.ca/wiki/MPI . Since it is based on the same software stack that we are using at the Centre for Advanced Computing, it is fully compatible.

As already pointed out, this FAQ is not an introduction to MPI programming. The standard reference text on MPI is:

Marc Snir, Steve Otto, Steven Huss-Lederman, David Walker, and Jack Dongarra:
MPI - The Complete Reference (2nd edition), The MIT Press, Cambridge, Massachusetts, 2000;
2 volumes, ISBN 0-262-69215-5 and 0-262-69213-3

This text specifies all MPI routines and concepts, and includes a large number of examples. Most people will find it sufficient for all their needs.

A quite good online tutorial for MPI programming can be found at the Argonne National Lab site.

There is also an official MPI webpage which contains the standards documents for MPI and gives access to the MPI Forum.

We are conducting Workshops on a regular basis, some devoted to MPI programming. They are announced on our web site. We might see you there sometime soon.

Some Tools

Standard debugging and profiling tools for serial or multi-threaded programs. They do not handle multi-process runs very well.

Quite often, the best way to check the performance of an MPI program is timing it by insertion of suitable routines. MPI supplies a "wall-clock" routine called MPI_WTIME(), that lets you determine how much actual time was spent in a specific segment of your code. An other method is calling the subroutines ETIME and DTIME, which can give you information about the actual CPU time used. However, it is advisable to carefully read the documentation before using them with MPI programs. In this case, refer to the Sun Studio 12: Fortran Library Reference.

Help

Send email to cac.help@queensu.ca. We have scientific programmers on staff who will probably be able to help you out. Of course, we can't do the coding for you but we do our best to get your code ready for parallel machines and clusters.