Setting up mpi4py
Today I’ve been trying to get emcee
up and running and to use it with MPI. There were a number of obstacles that I needed to overcome to get MPI working on multiple machines.
I have an anaconda installation of python, and mpi4py is one of the packages they support (more below), so I installed mpi4py using the following command:
But when I tried running it I got the following error:
Fortunately for me, someone else encountered a similar error. Until anaconda fixes their bug, an inelegant hack is to make a symbolic link pointing from /opt/ana… to the correct path. I have anaconda python installed for all users, so instead of pointing the symbolic link to my home directory, I pointed it to the directory containing the anaconda python distribution, the Applications directory:
I did this on all 5 machines that I plan on testing MPI on. The next step is to create a hostfile containing the names or IP addresses of all hosts that will be used for computing. This is just a simple text file containing nothing more than the name of the host and the number of slots, or threads, to run on it at any given time:
The example tests in the mpi4py documentation failed for me, but @jbornschein put together a nice github repository with some example code. I ran the 01-helloworld example, specifying the hosts I wanted to distribute the jobs to:
After a rather long delay in returning a result, an error was returned saying it could connect on the default port. That makes sense since we use non-default ports for our machines. After some searching, it doesn’t seem like there’s any way to specify the port in the hostfile, instead I used an ssh config file. I’m using Mac OS X 10.10, and setting a config file up was quite quick and easy to do:
and for contents add
After adding the hosts and ports to the config
file, I didn’t need to restart, or open a new terminal window or anything else that might be expected. Reissuing the same command in the same terminal window returned an immediate result:
However, as you can see things didn’t end very well. I had to kill the job, otherwise I end up with a strange error:
I downloaded the source code for mpi4py, changed directories into the downloaded directory and tried running their tests:
Running runtests.py
when the hostfile
contains only host1.example.edu
or host2.example.edu
works fine. However, when I include both hosts I end up with the same error as the helloworld.py
example. This error doesn’t crop up with the mpi4py
helloworld
example, just the mpi4py-examples
helloworld
. The difference between the failed and successful versions being that the version that fails waits for all jobs to finish up at the end through comm.Barrier()
.
This article, which discusses the communication was helpful — it appears the two computers, host1 and host2, can communicate with the machine I am running the commands from, but they cannot communicate with eachother.
I disabled the firewall through System Preferences -> Security & Privacy. This worked for the helloworld.py
example, but not for tests/runtests.py
. I should also mention that host1
has 12 cores and host2
has 8 cores. While testing out my hostfile I had reduced the number of “slots” on each to 4 and it worked. When I increased the number of slots on either host above 4, the connection failed error message reappeared when running helloworld.py
.
It looks like part of the problem may be that host1
and host2
are using different SSH ports, and therefore cannot communicate between each other. I switched to using two host machines that are using the same ports for SSH, and both helloworld.py
and runtests.py
finished up successfully.
However, adding a third machine is now causing issues…
This turned out to be that the firewall was still on despite turning it off through the GUI in system preferences. Simply turning the firewall off through the System Preferences GUI worked on 2 of the 3 machines, so I’m not sure why it got hung up on the third. If you’re having similar problems, try taking a look at either “All Messages” or “secure.log” in the console (or /var/log/secure.log) on the mac host. If you see lines stating that the Firewall refused connection, this is your problem too. Simply restarting the system did the trick. I’ve now added several more machines and have executed helloworld.py
and tests/runtests.py
successfully on 96 cores!