So, the last couple months I have been working on a project in clustering and high performance computing
. I haven't really dealt with much of it in my day job outside of clustering J2EE containers like Tomcat and Resin but that sort of clustering is something entirely different for the most part. What I wanted to build was something that would be able to perform parallel jobs and solve hard math, like something you might find at a university or government lab. TACC
is a great example or something you might find on Top500
. Obliviously something like TACC with 3,936 nodes and 123TB of memory isn't something I can afford to buy. So I decided to piece it together with cheap hardware from ebay. I first found a lot of three IBM Netvista's each with 512MB of RAM and a 866MHz Pentium 3 CPU. Then I picked up a Nortel Baystack 450-24T switch, I wanted something I could manage if needed and decent throughput, better than a Linksys or Netgear off the shelf at BestBuy. Lastly I needed a machine to become the front/main node that will be leader of the cluster. For this I found a Compaq EVO, 1.7GHz Pentium 4 with 512MB of RAM. After a few weeks I had all my hardware.
Next was to choose the software I wanted to run. I tried a few different setups, the first was PelicanHPC
. This clustering distro is based on Debian and works pretty well. It is completely CD based and there is very little setup involved, everything just boots over the network or off the CD. While quick and easy I shied away from using it for just that reason, I wanted something that is installed and permanent.
Next in line was Rocks
. Rocks is Centos based and uses Kickstart to distribute the OS to all the nodes of the cluster. It also has "rolls" that include all sorts of HPC software like Ganglia, Sun Grid Engine and Condor. Seemed like a nice all-in-one solution. I didn't have much luck with it due to only having 512MB in my main node, it needs at least 1GB to load up the image to install. I tried it on my laptop that has 4GB in it and it worked nicely but changing out hard drives in my laptop to play with the cluster was not something I wanted to do on a daily basis.
My next trial was using Centos 4 and installing OSCAR
. OSCAR has a decent install process and number of packages "built-in" like MPI and Ganglia. OSCAR also hasn't been updated in a couple years so it requires an older distro to work, like Centos 4 and Fedora Core 5. This isn't really a problem (I happen to like Centos 4 quite a bit) but as you might expect during the install I ran into a number of weird package dependency issues but was able to correct them. Once I got it installed on my main node I began to install it over the network to the other nodes, a process that should have been trivial I was unable to get working properly. So with that and the annoying package dependencies I moved on.
My final trial was using Centos 4 with Univa UD
's Unicluster Express 3.2. This includes Grid Engine, Globus, a dead simple install process and Ganglia (including Univa UD's own monitoring client). Once installed on the main node I began on the rest of them their installation script made this very easy. Once done I was able to attempt to submit jobs, I first attempted using one of their example jobs which basically just runs the 'date' command on the node that the job is sent to:
It worked! Grid Engine sent the job to one of the nodes and the node completed it. Next I wanted to try something a bit more intensive, I tried their 'worker.sh' example and it worked without an issue as well. Next I wanted to find out how to submit parallel jobs, after a few trial and error attempts using the wrong software I installed MPICH2
in a shared partition on the cluster, set the path and started it up and tested it out:
mpdboot -n 3 -f hostfile
mpdtrace -l -n 3 hostname
It worked without a hitch. I now needed to setup the queue to be 'batch interactive parallel' so I can submit parallel jobs and have then run immediately.
qconf -mattr queue qtype "BATCH INTERACTIVE" all.q
I also went ahead and wrote a script for the job I wanted to run which is the pi example included with MPICH.
#$ -N mpitest
#$ -pe mpi 3
#$ -now y
#$ -j y
/opt/share/mpich2/bin/mpiexec -l -n 1 -host front.esper /opt/share/mpich2/bin/cpi : -n 100 -host node01.esper /opt/share/mpich2/bin/cpi : -n 100 -host node02.esper /opt/share/mpich2/bin/cpi : -n 100 -host node03.esper /opt/share/mpich2/bin/cpi
When this script is submitted to the cluster using SGE 'qsub' it will run 100 processes on all the nodes and 1 on the main node. Since this is a parallel job they all work together at finding the result. I was pretty excited to find that it produced a result. My next project was to get linpack
running to test the performance of the cluster. After a bit of playing with the Make script and installing atlas and fortran I was able to have it compile. I kicked off a test in the same way I did in the pi script above and scored 1.919Gflops. Nothing to write home about but it works and it's good enough for me.
0: T/V N NB P Q Time Gflops
0: WR00R2R4 35 4 4 1 0.02 1.919e-03
All in all I have been pretty happy with Unicluster and continue to work with it and figure out all the cool stuff you can do with Grid Engine. One issue I have ran into is the monitoring software that is included hooks into Ganglia but currently I am not seeing anything on the graphs. Luckily the guys at Univa UD run Grid.org
and are fairly active in the forums. They have helped me out a couple times now.
It's fairly easy to get a cluster running (just time consuming) but as I am now finding out it is more difficult to keep it rolling and make life easy on myself while running. Here are a few tips I can share from my experience:
- Shared partitions are key, I use shared home directories and a directory '/opt/share' mounted on each machine. I initially wasn't doing this and found it cumbersome to do anything that needed to be on each machine. Have applications and libraries that need to be installed on each machine? Install them on a shared partition and set the PATH to the appropriate location in your bashrc. This allows you to install and set it once for all machines.
- Use SSH keys without pass phrases, life is easier without passwords.
- Read the docs. MPICH2, Unicluster and SGE all have pretty good docs and decent examples to get you rolling.
- Understand how to configure SGE queues and how they work. Here's a starter.
- Script everything! Script your jobs, MPI startup and anything else you do on a regular basis.
Lastly, there are likely things I failed to mention in this article so if you have a question or there is something missing just comment on this post and I will do my best to answer.