Parallel Programming in Python with Message Passing Interface (mpi4py)
Did you know you can write parallel Python code that will run on your laptop and on a super computer? You can, and it’s not as difficult as you might expect. If you already write code for asynchronous parallelization then you won’t even have to do much restructuring.
High Performance Computing (HPC) distributes pieces of jobs across thousands of CPUs (in contrast to the 4–8 on your laptop) to achieve dramatic performance increases. CPUs communicate and pass data using Message Passing Interface (MPI). The same principle is used on your laptop when you write code that distributes pieces of jobs to multiple cores to run simultaneously. This article will demonstrate how to use MPI with Python to write code that can be run in parallel on your laptop, or a super computer.
Install MPI
You will need to install a MPI application for your operating system. For Windows users I recommend installing MPI directly from Microsoft. For Mac and Linux users I suggest installing OpenMPI. Windows users must add the the MPI installation directory to the Path variable.
To test your install, type mpiexec
(Windows) or mpirun
(Mac/Linux, but check the install docs) in a terminal window and press ‘Enter’. This generates a message with usage information if you have installed properly. While you’re at the terminal, also type python
and press ‘Enter’. This should start an interactive Python session. If it does not you need to install or configure Python.
Install mpi4py
mpi4py
is a Python module that allows you to interact with your MPI application (mpiexec
or mpirun
). Install it the same as any Python module (pip install mpi4py
, etc.).
Once you have MPI and mpi4py
installed you’re ready to get started!
A Basic Example
Running a Python script with MPI is a little different than you’re likely used to. With mpiexec
and mpirun
each line of code will be run by each processor, unless specified otherwise. Let’s make a ‘hello world’ example to demonstrate the MPI basics.
Create a new python script (.py
file). Import mpi4py
and use MPI.COMM_WORLD
to get information about all the processors available to run your script (this number is passed to the MPI app when calling the script). COMM_WORLD
gives access to the number of processes (ranks/processors) available to distribute work across, and information about each processor. size
gives the total number of ranks, or processors, allocated to run our script. rank
gives the identifier of the processor currently executing the code. The print
statement below will print once for each processor used in the job.
Execute this script by opening a terminal, navigating to the directory containing the script, and executing the command:
mpiexec -n 4 python mpi_hello_world.py
n -4
specifies the number of processors to use. In this instance I’m using 4 processors, which means the print statement will execute 4 times. Notice that the ranks don’t print out in numerical order, so you’ll need to make sure your code can run asynchronously. In other words, it’s not possible to know which processor will start or complete first, so your code will need to be structured in a way that results don’t depend on values that may be calculated on a different processor.
Hello world from rank 2 of 4
Hello world from rank 3 of 4
Hello world from rank 0 of 4
Hello world from rank 1 of 4
Now, update the script so that it prints out different messages for different ranks. This is done using logical statements (if
, elif
, else
).
We now get different messages for ranks 0 and 1.
First rank
Hello world from rank 0 of 4
Not first or second rank
Hello world from rank 2 of 4
Not first or second rank
Hello world from rank 3 of 4
Second rank
Hello world from rank 1 of 4
Send and Receive Arrays
The send
and recv
functions send data from one processor to another and receive data from a processor, respectively. Many data types can be passed with these functions. The example is going to focus specifically on sending and receiving numpy
arrays. The Send
and Recv
functions (notice capital ‘S’ and ‘R’) are specific to numpy
arrays. For examples of basic send
and recv
see the mpi4py
documentation.
In a previous article I demonstrated parallel processing with the multiprocessing
module. We’ll use the same function in this example .Asynchronous Parallel Programming in Python with MultiprocessingA flexible method to speed up code on a personal computertowardsdatascience.com
Create two new Python scripts in the same directory. Name one my_function.py
and the other mpi_my_function.py
. In my_function.py
implement the function from the article linked above. Your script should look like this. This is a simple function, with a pause to simulate a long run time.
These paragraphs explain the parallelization procedure for my_function
. The code is given in the gist below (with comments). In mpi_my_function.py
import my_function
, mpi4py
, and numpy
. Then get the size and rank from MPI.COMM_WORLD
. Use numpy
to create random parameter values for my_function
. The params
variable will be available on all processors. Now divide up the list of parameters, assigning a chunk of the array to each process (or rank). I’ve specifically made the number of rows in params
(15) oddly divisible by the number of processors (4) so that we have to do a little extra math to break up params
. Now each processor has a variable indexing the start
and stop
locations of its chunk in the params
array.
We want to final result to be an array with the parameter values and function result for each parameter set. Create an empty array, local_results
with the same number of rows as the parameter array and one extra column to store the results. Then run my_function
for each parameter set and save the result in the result array (local_results
).Now each processor has results for its chunk of the params
array.
The results must be gathered to create a final array with results for each of the original parameter combinations. Send the local_results
arrays from each rank to rank ‘0’, where they are combined to a single array. When using Send
specify the rank to send to , dest
, and specify a tag
(unique integer) so the receiving rank knows which value to retrieve (this is important if you end up executing more than one Send
).
For the receiving rank (0), loop through all the other ranks, create an empty array the size of the array to be received, and retrieve the sent values from each rank with Recv
, specifying the rank to receive from and the tag
. Once the array is retrieved, add it to the existing values. Print out the final array to make sure it looks correct. And we’re done!
Run the script above with:
mpiexec -n 4 python mpi_my_function.py
The result should resemble:
results
[[7.58886620e+00 5.62618310e+01 9.09064771e+01 3.33107541e+03]
[2.76707037e+01 4.03218572e+01 2.20310537e+01 3.08951805e+04]
[7.82729169e+01 9.40939134e+01 7.24046134e+01 5.76552834e+05]
[9.88496826e+01 6.91320832e+00 1.59490375e+01 6.75667032e+04]
[8.94286742e+01 8.88605014e+01 5.31814181e+01 7.10713954e+05]
[3.83757552e+01 4.64666288e+01 3.72791712e+01 6.84686177e+04]
[9.33796247e+01 1.71058163e+01 2.94036272e+00 1.49161456e+05]
[1.49763382e+01 6.77803268e+01 7.62249839e+01 1.52787224e+04]
[7.42368720e+01 8.45623531e+01 6.27481273e+01 4.66095445e+05]
[6.76429554e+01 5.95075836e+01 9.82287031e+00 2.72290902e+05]
[4.94157194e+00 7.38840592e+01 3.70077813e+00 1.80788546e+03]
[2.71179540e+01 2.94973140e+00 2.86632603e+01 2.19784685e+03]
[2.92793532e+01 9.90621647e+01 9.45343344e+01 8.50185987e+04]
[1.20975353e+01 8.89643839e+01 7.13313160e+01 1.30913009e+04]
[8.45193908e+01 4.89884544e+01 5.67737042e+01 3.50007141e+05]]
Conclusion
You’ve probably noticed it took 43 lines of code to run a 4 line function in parallel. Seems like a bit of overkill. We parallelized a simple function, but this is actually a more complicated example because of the manipulation required to get the desired output format. This example serves as a parallelization template for more elaborate functions. Generally, I’ll write a Python module that runs my models/analyses, then setup and call one function (that does everything) from the module in the parallelization script (43-line gist above). You can run really complex analyses without adding much code to the script we’ve created here. Once you get this code working on your personal machine it can be run on most super computers without much additional work.
Happy parallelization!