Lately I have been running quite a few jobs on a few different cluster:
the Yale Omega HPC cluster, an on demand AWS EC2 cluster using the
starcluster package, and the our exoplanets research group cluster.
There have been so many different runs that it is becoming difficult
to keep track of which code base corresponds to which results.
This problem has been amplified by the up to two day delay between
submitting a job on Omega, and when it actually runs.
To assist with the association of version history with job output I
thought it would be nice to write the latest github commit id for a
file to the output data directory. Fortunately, I am not the only
one with this desire, and many people have worked on the
dulwich project, which is a robust library for handling
git data within Python.
####Retrieving commit data with Dulwich
I did not find dulwich exactly intuitive, but fortunately there is
a decent example in their github repo, and it only took a little
bit of hacking to get it to get my git data. Below is the code I
used to
The above code prints to screen the most recent commit id for the
executed file. I also want to print the most recent commit id for
the executed file and the whole repository to an output directory
where all the output data go:
####The Result
The code prints the following line at run time
and makes the file git_info.txt, which contains the following: