In a previous post, I described how to setup a cluster using the Amazon EC2 service and the starcluster package. In this post I will describe getting Python code up and running on a Starcluster cluster instance on Amazon EC2 with MPI.
####Starting up the cluster
First, we need to start our cluster instance. This can easily be done using starcluster by typing the following command:
To SSH into the master node of the cluster:
The SSH port on my machine is different from the default port, and it caused some problems with starcluster.
One solution to this is to add an entry in ~/.ssh/config to specify the port to the master node:
But this means adding a new entry every time a new cluster is started! That’s not an acceptable solution.
I ended up changing the SSH port in /etc/servicesback to 22, and then modifying the /System/Library/LaunchDaemons/ssh.plist file to a port other than 22.
Here’s what that looks like in OS X Yosemite:
Changing the port number to 2123 in the ssh.plist makes it so that my computer only allows incoming connections to port 2123, but the default port when trying to connect to other machines is still 22. Perfect!
####Setting up python
Next, we want to get the proper version of python installed on our cluster. The instances that come with starcluster by default have a ton of useful tools already built in. Immediately after SSHing into the master node, the available tools are listed:
This is really great, but a lot of the versions listed here are dated, and the APIs have changed considerably. One option is to upgrade some of these libraries with pip:
One down side of simply upgrading the dependencies is that it takes a while — it took me 10 minutes and 7 seconds to start up a cluster and upgrade numpy and pandas. If you do not want to wait ten minutes every time you start up a cluster, another option is to use miniconda, a lightweight package that only contains conda and python, and then install only the necessary dependencies. I used miniconda when setting up travis-ci, which was covered briefly in this post.
When I switched from the version of python that comes built in with starcluster to miniconda, the total time to start up a cluster dropped from 10m7s to 4m57s. Switching to miniconda is not difficult; just type the following three lines once your cluster is started up and you have SSHed into the master node:
And then installing numpy and pandas is much faster than upgrading the starcluster version:
I create a starcluster startup script that sets up my environment, uploads all my coda and input data to the cluster, updates and installs all python dependencies, and then submits the MPI job. See the end of this post for the script.
###Running MPI on the starcluster
Looking at the starcluster Compile and run a “Hello World” OpenMPI program example, it looks like we need to run things as sgeadmin in order to use MPI. Indeed, when I tried running things as root, I got error message saying it couldn’t access things in the root directory:
The /root directory only exists on the master node. We need to run our code from somewhere in the /home directory, since that’s the directory that is NFS mounted to all the nodes. This may also mean that upgrading pandas and numexpr only worked for the master node.
###Putting it all together
I made a script that will start up the cluster, and copy over all of the code and input data I want to use, then start an MPI job. Here’s what that script looks like:
In the line that I use to run the job, I change directories into the directory with my code, and then call on qsub to execute my routine. The -cwd option tells qsub to start the code from the current working directory, and the -pe orte 4 tells it to use the orte parallel environment with 4 cores. The contents of evol_starcluster_qsub_test.sh are just the program to be executed with MPI.
Contents of evol_starcluster_qsub_test.sh:
When running this script, user interaction is still required at one point. When SSHing into the master node for the first time, SSH will ask to confirm the identity of the machine:
If you want to skip that, you can add the following to your ~/.ssh/config file: