Setting up Python and MPI on EC2
In a previous post, I described how to setup a cluster using the Amazon EC2 service and the starcluster package. In this post I will describe getting Python code up and running on a Starcluster cluster instance on Amazon EC2 with MPI.
####Starting up the cluster
First, we need to start our cluster instance. This can easily be done using starcluster by typing the following command:
starcluster start mycluster
To SSH into the master node of the cluster:
starcluster sshmaster mycluster
The SSH port on my machine is different from the default port, and it caused some problems with starcluster.
starcluster sshmaster mycluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu
ssh: connect to host ec2-25-3-89-123.compute-1.amazonaws.com port 2234: Operation timed out
One solution to this is to add an entry in ~/.ssh/config
to specify the port to the master node:
Host ec2-25-3-89-123.compute-1.amazonaws.com
Hostname ec2-25-3-89-123.compute-1.amazonaws.com
Port 22
But this means adding a new entry every time a new cluster is started! That’s not an acceptable solution.
I ended up changing the SSH port in /etc/services
back to 22, and then modifying the /System/Library/LaunchDaemons/ssh.plist
file to a port other than 22.
Here’s what that looks like in OS X Yosemite:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Disabled</key>
<true/>
<key>Label</key>
<string>com.openssh.sshd</string>
<key>Program</key>
<string>/usr/libexec/sshd-keygen-wrapper</string>
<key>ProgramArguments</key>
<array>
<string>/usr/sbin/sshd</string>
<string>-i</string>
</array>
<key>Sockets</key>
<dict>
<key>Listeners</key>
<dict>
<key>SockServiceName</key>
<string>2123</string>
<key>Bonjour</key>
<array>
<string>ssh</string>
<string>2123</string>
</array>
</dict>
</dict>
<key>inetdCompatibility</key>
<dict>
<key>Wait</key>
<false/>
</dict>
<key>StandardErrorPath</key>
<string>/dev/null</string>
<key>SHAuthorizationRight</key>
<string>system.preferences</string>
<key>POSIXSpawnType</key>
<string>Interactive</string>
</dict>
</plist>
Changing the port number to 2123 in the ssh.plist makes it so that my computer only allows incoming connections to port 2123, but the default port when trying to connect to other machines is still 22. Perfect!
####Setting up python
Next, we want to get the proper version of python installed on our cluster. The instances that come with starcluster by default have a ton of useful tools already built in. Immediately after SSHing into the master node, the available tools are listed:
starcluster sshmaster mycluster
StarCluster - (http://star.mit.edu/cluster) (v. 0.95.6)
Software Tools for Academics and Researchers (STAR)
Please submit bug reports to starcluster@mit.edu
The authenticity of host 'ec2-25-3-89-123.compute-1.amazonaws.com (25.3.89.123)' can't be established.
RSA key fingerprint is ab:ef:d8:2f:3c:78:b3:a2:a2:2c:4d:8e:3f:7e:2a:8c.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'ec2-25-3-89-123.compute-1.amazonaws.com,25.3.89.123' (RSA) to the list of known hosts.
_ _ _
__/\_____| |_ __ _ _ __ ___| |_ _ ___| |_ ___ _ __
\ / __| __/ _` | '__/ __| | | | / __| __/ _ \ '__|
/_ _\__ \ || (_| | | | (__| | |_| \__ \ || __/ |
\/ |___/\__\__,_|_| \___|_|\__,_|___/\__\___|_|
StarCluster Ubuntu 13.04 AMI
Software Tools for Academics and Researchers (STAR)
Homepage: http://star.mit.edu/cluster
Documentation: http://star.mit.edu/cluster/docs/latest
Code: https://github.com/jtriley/StarCluster
Mailing list: http://star.mit.edu/cluster/mailinglist.html
This AMI Contains:
* NVIDIA Driver 331.38
* NVIDIA CUDA Toolkit 5.5.22
* PyCuda 2013.1.1 and PyOpenCL 2013.2
* MAGMA 1.4.1
* Intel Ethernet Driver 2.11.3 (ixgbevf)
* Open Grid Scheduler (OGS - formerly SGE) queuing system
* Condor workload management system
* OpenMPI compiled with Open Grid Scheduler support
* OpenBLAS - Highly optimized Basic Linear Algebra Routines
* NumPy/SciPy linked against OpenBlas
* Pandas - Data Analysis Library
* IPython 1.1.0 with parallel and notebook support
* Julia 0.3pre
* and more! (use 'dpkg -l' to show all installed packages)
Open Grid Scheduler/Condor cheat sheet:
* qstat/condor_q - show status of batch jobs
* qhost/condor_status- show status of hosts, queues, and jobs
* qsub/condor_submit - submit batch jobs (e.g. qsub -cwd ./job.sh)
* qdel/condor_rm - delete batch jobs (e.g. qdel 7)
* qconf - configure Open Grid Scheduler system
Current System Stats:
System load: 0.0 Processes: 94
Usage of /: 63.3% of 7.74GB Users logged in: 0
Memory usage: 11% IP address for eth0: 123.45.6.78
Swap usage: 0%
https://landscape.canonical.com/
root@master:~#
This is really great, but a lot of the versions listed here are dated, and the APIs have changed considerably. One option is to upgrade some of these libraries with pip:
pip install pandas --upgrade
pip install numexpr --upgrade
One down side of simply upgrading the dependencies is that it takes a while — it took me 10 minutes and 7 seconds to start up a cluster and upgrade numpy and pandas. If you do not want to wait ten minutes every time you start up a cluster, another option is to use miniconda, a lightweight package that only contains conda and python, and then install only the necessary dependencies. I used miniconda when setting up travis-ci, which was covered briefly in this post.
When I switched from the version of python that comes built in with starcluster to miniconda, the total time to start up a cluster dropped from 10m7s to 4m57s. Switching to miniconda is not difficult; just type the following three lines once your cluster is started up and you have SSHed into the master node:
cd /home/sgeadmin; wget http://repo.continuum.io/miniconda/Miniconda-3.9.1-Linux-x86_64.sh -O miniconda.sh; bash miniconda.sh -b -p /home/sgeadmin/miniconda
echo "export PATH=/home/sgeadmin/miniconda/bin:\$PATH" >> .bashrc; hash -r
export PATH=/home/sgeadmin/miniconda/bin:$PATH; conda config --set always_yes yes --set changeps1 no
And then installing numpy and pandas is much faster than upgrading the starcluster version:
conda install numpy
conda install pandas
I create a starcluster startup script that sets up my environment, uploads all my coda and input data to the cluster, updates and installs all python dependencies, and then submits the MPI job. See the end of this post for the script.
###Running MPI on the starcluster
Looking at the starcluster Compile and run a “Hello World” OpenMPI program example, it looks like we need to run things as sgeadmin in order to use MPI. Indeed, when I tried running things as root, I got error message saying it couldn’t access things in the root directory:
qrsh_starter: cannot change to directory /root/projects
The /root directory only exists on the master node. We need to run our code from somewhere in the /home directory, since that’s the directory that is NFS mounted to all the nodes. This may also mean that upgrading pandas and numexpr only worked for the master node.
###Putting it all together
I made a script that will start up the cluster, and copy over all of the code and input data I want to use, then start an MPI job. Here’s what that script looks like:
#!/usr/bin/env bash
echo "Starting cluster..."
starcluster start mycluster
echo "Now upgrading pandas and numexpr..."
#upgrade numpy and pandas:
starcluster sshmaster mycluster 'pip install pandas --upgrade'
starcluster sshmaster mycluster 'pip install numexpr --upgrade'
#install dependencies:
starcluster sshmaster mycluster 'pip install emcee'
#append my projects directory to the Python path:
starcluster sshmaster mycluster 'echo " " >> .bashrc'
starcluster sshmaster mycluster 'echo "#Add my projects dir to python path:" >> .bashrc'
starcluster sshmaster mycluster 'echo "export PYTHONPATH=/root/projects:\$PYTHONPATH" >> .bashrc'
#add directories and copy over code:
echo "Creating projects directory..."
starcluster sshmaster mycluster mkdir projects
echo "Creating MOST directory..."
starcluster sshmaster mycluster mkdir projects/MOST
echo "Copying over module files..."
starcluster put mycluster /home/matt/projects/MOST/__init__.py projects/MOST/
starcluster put mycluster /home/matt/projects/MOST/setup.py projects/MOST/
echo "Creating code directory..."
starcluster sshmaster mycluster mkdir projects/MOST/code
echo "Copying over the code..."
starcluster put mycluster /home/matt/projects/MOST/code projects/MOST/
#now add the data directories and input data:
echo "Creating data directories..."
starcluster sshmaster mycluster mkdir projects/MOST/data
starcluster sshmaster mycluster mkdir projects/MOST/data/chiron
starcluster sshmaster mycluster mkdir projects/MOST/data/most
starcluster sshmaster mycluster mkdir projects/MOST/data/MCMC
echo "Copying over CHIRON data..."
starcluster put mycluster /home/matt/projects/MOST/data/chiron/epsEriChironSepDec2014.txt projects/MOST/data/chiron/
echo "Copying over MOST data..."
starcluster put mycluster /home/matt/projects/MOST/data/most/epsEriMostFullRedResamp.txt projects/MOST/data/most/
#now run the job
starcluster sshmaster mycluster 'cd projects/MOST/code; qsub -cwd -pe orte 4 ./evol_starcluster_qsub_test.sh'
starcluster sshmaster mycluster 'qstat'
In the line that I use to run the job, I change directories into the directory with my code, and then call on qsub to execute my routine. The -cwd option tells qsub to start the code from the current working directory, and the -pe orte 4 tells it to use the orte parallel environment with 4 cores. The contents of evol_starcluster_qsub_test.sh
are just the program to be executed with MPI.
Contents of evol_starcluster_qsub_test.sh:
#!/bin/bash
mpiexec python eeTwoSptParTmpDfRtEvol.py 3 90 100 --thin 2
When running this script, user interaction is still required at one point. When SSHing into the master node for the first time, SSH will ask to confirm the identity of the machine:
The authenticity of host 'ec2-22-3-444-55.compute-1.amazonaws.com (22.3.44.55)' can't be established.
RSA key fingerprint is 12:34:56:78:90:aa:bb:cc:dd:ee:ff:11::22:33:44.
Are you sure you want to continue connecting (yes/no)? yes
If you want to skip that, you can add the following to your ~/.ssh/config
file:
StrictHostKeyChecking no