In a previous post, I described how to setup a cluster using the Amazon EC2 service and the starcluster package. In this post I will describe getting Python code up and running on a Starcluster cluster instance on Amazon EC2 with MPI.

####Starting up the cluster

First, we need to start our cluster instance. This can easily be done using starcluster by typing the following command:

starcluster start mycluster

To SSH into the master node of the cluster:

starcluster sshmaster mycluster

The SSH port on my machine is different from the default port, and it caused some problems with starcluster.

starcluster sshmaster mycluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

ssh: connect to host ec2-25-3-89-123.compute-1.amazonaws.com port 2234: Operation timed out

One solution to this is to add an entry in ~/.ssh/config to specify the port to the master node:

Host ec2-25-3-89-123.compute-1.amazonaws.com
   Hostname ec2-25-3-89-123.compute-1.amazonaws.com
   Port 22

But this means adding a new entry every time a new cluster is started! That’s not an acceptable solution.

I ended up changing the SSH port in /etc/services back to 22, and then modifying the /System/Library/LaunchDaemons/ssh.plist file to a port other than 22.

Here’s what that looks like in OS X Yosemite:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>Disabled</key>
	<true/>
	<key>Label</key>
	<string>com.openssh.sshd</string>
	<key>Program</key>
	<string>/usr/libexec/sshd-keygen-wrapper</string>
	<key>ProgramArguments</key>
	<array>
		<string>/usr/sbin/sshd</string>
		<string>-i</string>
	</array>
	<key>Sockets</key>
	<dict>
		<key>Listeners</key>
		<dict>
			<key>SockServiceName</key>
			<string>2123</string>
			<key>Bonjour</key>
			<array>
				<string>ssh</string>
				<string>2123</string>
			</array>
		</dict>
	</dict>
	<key>inetdCompatibility</key>
	<dict>
		<key>Wait</key>
		<false/>
	</dict>
	<key>StandardErrorPath</key>
	<string>/dev/null</string>
	<key>SHAuthorizationRight</key>
	<string>system.preferences</string>
	<key>POSIXSpawnType</key>
	<string>Interactive</string>
</dict>
</plist>

Changing the port number to 2123 in the ssh.plist makes it so that my computer only allows incoming connections to port 2123, but the default port when trying to connect to other machines is still 22. Perfect!

####Setting up python

Next, we want to get the proper version of python installed on our cluster. The instances that come with starcluster by default have a ton of useful tools already built in. Immediately after SSHing into the master node, the available tools are listed:

starcluster sshmaster mycluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu

The authenticity of host 'ec2-25-3-89-123.compute-1.amazonaws.com (25.3.89.123)' can't be established.
RSA key fingerprint is ab:ef:d8:2f:3c:78:b3:a2:a2:2c:4d:8e:3f:7e:2a:8c.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-25-3-89-123.compute-1.amazonaws.com,25.3.89.123' (RSA) to the list of known hosts.
          _                 _           _
__/\_____| |_ __ _ _ __ ___| |_   _ ___| |_ ___ _ __
\    / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|
/_  _\__ \ || (_| | | | (__| | |_| \__ \ ||  __/ |
  \/ |___/\__\__,_|_|  \___|_|\__,_|___/\__\___|_|

StarCluster Ubuntu 13.04 AMI
Software Tools for Academics and Researchers (STAR)
Homepage: http://star.mit.edu/cluster
Documentation: http://star.mit.edu/cluster/docs/latest
Code: https://github.com/jtriley/StarCluster
Mailing list: http://star.mit.edu/cluster/mailinglist.html

This AMI Contains:

  * NVIDIA Driver 331.38
  * NVIDIA CUDA Toolkit 5.5.22
  * PyCuda 2013.1.1 and PyOpenCL 2013.2
  * MAGMA 1.4.1
  * Intel Ethernet Driver 2.11.3 (ixgbevf)
  * Open Grid Scheduler (OGS - formerly SGE) queuing system
  * Condor workload management system
  * OpenMPI compiled with Open Grid Scheduler support
  * OpenBLAS - Highly optimized Basic Linear Algebra Routines
  * NumPy/SciPy linked against OpenBlas
  * Pandas - Data Analysis Library
  * IPython 1.1.0 with parallel and notebook support
  * Julia 0.3pre
  * and more! (use 'dpkg -l' to show all installed packages)

Open Grid Scheduler/Condor cheat sheet:

  * qstat/condor_q - show status of batch jobs
  * qhost/condor_status- show status of hosts, queues, and jobs
  * qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)
  * qdel/condor_rm - delete batch jobs (e.g. qdel 7)
  * qconf - configure Open Grid Scheduler system

Current System Stats:

  System load:  0.0               Processes:           94
  Usage of /:   63.3% of 7.74GB   Users logged in:     0
  Memory usage: 11%               IP address for eth0: 123.45.6.78
  Swap usage:   0%

    https://landscape.canonical.com/
root@master:~#

This is really great, but a lot of the versions listed here are dated, and the APIs have changed considerably. One option is to upgrade some of these libraries with pip:

pip install pandas --upgrade
pip install numexpr --upgrade

One down side of simply upgrading the dependencies is that it takes a while — it took me 10 minutes and 7 seconds to start up a cluster and upgrade numpy and pandas. If you do not want to wait ten minutes every time you start up a cluster, another option is to use miniconda, a lightweight package that only contains conda and python, and then install only the necessary dependencies. I used miniconda when setting up travis-ci, which was covered briefly in this post.

When I switched from the version of python that comes built in with starcluster to miniconda, the total time to start up a cluster dropped from 10m7s to 4m57s. Switching to miniconda is not difficult; just type the following three lines once your cluster is started up and you have SSHed into the master node:

cd /home/sgeadmin; wget http://repo.continuum.io/miniconda/Miniconda-3.9.1-Linux-x86_64.sh -O miniconda.sh; bash miniconda.sh -b -p /home/sgeadmin/miniconda
echo "export PATH=/home/sgeadmin/miniconda/bin:\$PATH" >> .bashrc; hash -r
export PATH=/home/sgeadmin/miniconda/bin:$PATH; conda config --set always_yes yes --set changeps1 no

And then installing numpy and pandas is much faster than upgrading the starcluster version:

conda install numpy
conda install pandas

I create a starcluster startup script that sets up my environment, uploads all my coda and input data to the cluster, updates and installs all python dependencies, and then submits the MPI job. See the end of this post for the script.

###Running MPI on the starcluster

Looking at the starcluster Compile and run a “Hello World” OpenMPI program example, it looks like we need to run things as sgeadmin in order to use MPI. Indeed, when I tried running things as root, I got error message saying it couldn’t access things in the root directory:

qrsh_starter: cannot change to directory /root/projects

The /root directory only exists on the master node. We need to run our code from somewhere in the /home directory, since that’s the directory that is NFS mounted to all the nodes. This may also mean that upgrading pandas and numexpr only worked for the master node.

###Putting it all together

I made a script that will start up the cluster, and copy over all of the code and input data I want to use, then start an MPI job. Here’s what that script looks like:

#!/usr/bin/env bash

echo "Starting cluster..."
starcluster start mycluster

echo "Now upgrading pandas and numexpr..."
#upgrade numpy and pandas:
starcluster sshmaster mycluster 'pip install pandas --upgrade'
starcluster sshmaster mycluster 'pip install numexpr --upgrade'

#install dependencies:
starcluster sshmaster mycluster 'pip install emcee'

#append my projects directory to the Python path:
starcluster sshmaster mycluster 'echo " " >> .bashrc'
starcluster sshmaster mycluster 'echo "#Add my projects dir to python path:" >> .bashrc'
starcluster sshmaster mycluster 'echo "export PYTHONPATH=/root/projects:\$PYTHONPATH" >> .bashrc'

#add directories and copy over code:
echo "Creating projects directory..."
starcluster sshmaster mycluster mkdir projects
echo "Creating MOST directory..."
starcluster sshmaster mycluster mkdir projects/MOST
echo "Copying over module files..."
starcluster put mycluster /home/matt/projects/MOST/__init__.py projects/MOST/
starcluster put mycluster /home/matt/projects/MOST/setup.py projects/MOST/
echo "Creating code directory..."
starcluster sshmaster mycluster mkdir projects/MOST/code
echo "Copying over the code..."
starcluster put mycluster /home/matt/projects/MOST/code projects/MOST/


#now add the data directories and input data:
echo "Creating data directories..."
starcluster sshmaster mycluster mkdir projects/MOST/data
starcluster sshmaster mycluster mkdir projects/MOST/data/chiron
starcluster sshmaster mycluster mkdir projects/MOST/data/most
starcluster sshmaster mycluster mkdir projects/MOST/data/MCMC
echo "Copying over CHIRON data..."
starcluster put mycluster /home/matt/projects/MOST/data/chiron/epsEriChironSepDec2014.txt projects/MOST/data/chiron/
echo "Copying over MOST data..."
starcluster put mycluster /home/matt/projects/MOST/data/most/epsEriMostFullRedResamp.txt projects/MOST/data/most/

#now run the job
starcluster sshmaster mycluster 'cd projects/MOST/code; qsub -cwd -pe orte 4 ./evol_starcluster_qsub_test.sh'

starcluster sshmaster mycluster 'qstat'

In the line that I use to run the job, I change directories into the directory with my code, and then call on qsub to execute my routine. The -cwd option tells qsub to start the code from the current working directory, and the -pe orte 4 tells it to use the orte parallel environment with 4 cores. The contents of evol_starcluster_qsub_test.sh are just the program to be executed with MPI.

Contents of evol_starcluster_qsub_test.sh:

#!/bin/bash

mpiexec python eeTwoSptParTmpDfRtEvol.py 3 90 100 --thin 2

When running this script, user interaction is still required at one point. When SSHing into the master node for the first time, SSH will ask to confirm the identity of the machine:

The authenticity of host 'ec2-22-3-444-55.compute-1.amazonaws.com (22.3.44.55)' can't be established.
RSA key fingerprint is 12:34:56:78:90:aa:bb:cc:dd:ee:ff:11::22:33:44.
Are you sure you want to continue connecting (yes/no)? yes

If you want to skip that, you can add the following to your ~/.ssh/config file:

StrictHostKeyChecking no