| | | |

Parallel Programming in Python with Message Passing Interface (mpi4py)

Did you know you can write parallel Python code that will run on your laptop and on a super computer? You can, and it’s not as difficult as you might expect. If you already write code for asynchronous parallelization then you won’t even have to do much restructuring.

High Performance Computing (HPC) distributes pieces of jobs across thousands of CPUs (in contrast to the 4–8 on your laptop) to achieve dramatic performance increases. CPUs communicate and pass data using Message Passing Interface (MPI). The same principle is used on your laptop when you write code that distributes pieces of jobs to multiple cores to run simultaneously. This article will demonstrate how to use MPI with Python to write code that can be run in parallel on your laptop, or a super computer.

Install MPI

You will need to install a MPI application for your operating system. For Windows users I recommend installing MPI directly from Microsoft. For Mac and Linux users I suggest installing OpenMPI. Windows users must add the the MPI installation directory to the Path variable.

To test your install, type mpiexec (Windows) or mpirun (Mac/Linux, but check the install docs) in a terminal window and press ‘Enter’. This generates a message with usage information if you have installed properly. While you’re at the terminal, also type python and press ‘Enter’. This should start an interactive Python session. If it does not you need to install or configure Python.

Install mpi4py

mpi4py is a Python module that allows you to interact with your MPI application (mpiexec or mpirun). Install it the same as any Python module (pip install mpi4py, etc.).

Once you have MPI and mpi4py installed you’re ready to get started!

A Basic Example

Running a Python script with MPI is a little different than you’re likely used to. With mpiexec and mpirun each line of code will be run by each processor, unless specified otherwise. Let’s make a ‘hello world’ example to demonstrate the MPI basics.

Create a new python script (.py file). Import mpi4py and use MPI.COMM_WORLD to get information about all the processors available to run your script (this number is passed to the MPI app when calling the script). COMM_WORLD gives access to the number of processes (ranks/processors) available to distribute work across, and information about each processor. size gives the total number of ranks, or processors, allocated to run our script. rank gives the identifier of the processor currently executing the code. The print statement below will print once for each processor used in the job.

Execute this script by opening a terminal, navigating to the directory containing the script, and executing the command:

mpiexec -n 4 python mpi_hello_world.py

n -4 specifies the number of processors to use. In this instance I’m using 4 processors, which means the print statement will execute 4 times. Notice that the ranks don’t print out in numerical order, so you’ll need to make sure your code can run asynchronously. In other words, it’s not possible to know which processor will start or complete first, so your code will need to be structured in a way that results don’t depend on values that may be calculated on a different processor.

Hello world from rank 2 of 4
Hello world from rank 3 of 4
Hello world from rank 0 of 4
Hello world from rank 1 of 4

Now, update the script so that it prints out different messages for different ranks. This is done using logical statements (ifelifelse).

We now get different messages for ranks 0 and 1.

First rank
Hello world from rank 0 of 4
Not first or second rank
Hello world from rank 2 of 4
Not first or second rank
Hello world from rank 3 of 4
Second rank
Hello world from rank 1 of 4

Send and Receive Arrays

The send and recv functions send data from one processor to another and receive data from a processor, respectively. Many data types can be passed with these functions. The example is going to focus specifically on sending and receiving numpy arrays. The Send and Recv functions (notice capital ‘S’ and ‘R’) are specific to numpy arrays. For examples of basic send and recv see the mpi4py documentation.

In a previous article I demonstrated parallel processing with the multiprocessing module. We’ll use the same function in this example .Asynchronous Parallel Programming in Python with MultiprocessingA flexible method to speed up code on a personal computertowardsdatascience.com

Create two new Python scripts in the same directory. Name one my_function.py and the other mpi_my_function.py. In my_function.py implement the function from the article linked above. Your script should look like this. This is a simple function, with a pause to simulate a long run time.

These paragraphs explain the parallelization procedure for my_function. The code is given in the gist below (with comments). In mpi_my_function.py import my_functionmpi4py, and numpy. Then get the size and rank from MPI.COMM_WORLD. Use numpy to create random parameter values for my_function. The params variable will be available on all processors. Now divide up the list of parameters, assigning a chunk of the array to each process (or rank). I’ve specifically made the number of rows in params (15) oddly divisible by the number of processors (4) so that we have to do a little extra math to break up params. Now each processor has a variable indexing the start and stop locations of its chunk in the params array.

We want to final result to be an array with the parameter values and function result for each parameter set. Create an empty array, local_results with the same number of rows as the parameter array and one extra column to store the results. Then run my_function for each parameter set and save the result in the result array (local_results).Now each processor has results for its chunk of the params array.

The results must be gathered to create a final array with results for each of the original parameter combinations. Send the local_results arrays from each rank to rank ‘0’, where they are combined to a single array. When using Send specify the rank to send to , dest, and specify a tag (unique integer) so the receiving rank knows which value to retrieve (this is important if you end up executing more than one Send).

For the receiving rank (0), loop through all the other ranks, create an empty array the size of the array to be received, and retrieve the sent values from each rank with Recv, specifying the rank to receive from and the tag. Once the array is retrieved, add it to the existing values. Print out the final array to make sure it looks correct. And we’re done!

Run the script above with:

mpiexec -n 4 python mpi_my_function.py

The result should resemble:

results
[[7.58886620e+00 5.62618310e+01 9.09064771e+01 3.33107541e+03]
[2.76707037e+01 4.03218572e+01 2.20310537e+01 3.08951805e+04]
[7.82729169e+01 9.40939134e+01 7.24046134e+01 5.76552834e+05]
[9.88496826e+01 6.91320832e+00 1.59490375e+01 6.75667032e+04]
[8.94286742e+01 8.88605014e+01 5.31814181e+01 7.10713954e+05]
[3.83757552e+01 4.64666288e+01 3.72791712e+01 6.84686177e+04]
[9.33796247e+01 1.71058163e+01 2.94036272e+00 1.49161456e+05]
[1.49763382e+01 6.77803268e+01 7.62249839e+01 1.52787224e+04]
[7.42368720e+01 8.45623531e+01 6.27481273e+01 4.66095445e+05]
[6.76429554e+01 5.95075836e+01 9.82287031e+00 2.72290902e+05]
[4.94157194e+00 7.38840592e+01 3.70077813e+00 1.80788546e+03]
[2.71179540e+01 2.94973140e+00 2.86632603e+01 2.19784685e+03]
[2.92793532e+01 9.90621647e+01 9.45343344e+01 8.50185987e+04]
[1.20975353e+01 8.89643839e+01 7.13313160e+01 1.30913009e+04]
[8.45193908e+01 4.89884544e+01 5.67737042e+01 3.50007141e+05]]

Conclusion

You’ve probably noticed it took 43 lines of code to run a 4 line function in parallel. Seems like a bit of overkill. We parallelized a simple function, but this is actually a more complicated example because of the manipulation required to get the desired output format. This example serves as a parallelization template for more elaborate functions. Generally, I’ll write a Python module that runs my models/analyses, then setup and call one function (that does everything) from the module in the parallelization script (43-line gist above). You can run really complex analyses without adding much code to the script we’ve created here. Once you get this code working on your personal machine it can be run on most super computers without much additional work.

Happy parallelization!

Similar Posts