Since I have built the cluster I have been trying to find cool things to do with it. One cool project is Hadoop
From Wikipedia: Apache Hadoop is a Free Java software framework that supports data intensive distributed applications running on large clusters of commodity computers.  It enables applications to easily scale out to thousands of nodes and petabytes of data. Hadoop was inspired by Google's MapReduce and Google File System (GFS) papers.
methodology has really caught on, Yahoo now uses it
as well and actually use Hadoop specifically. Basically it can be used anywhere you have a large dataset that needs to be processed that can't be handled by single machine.
Setting Hadoop up wasn't a problem, Hadoop Wiki
and this site
had tons of good info. Basically the idea is that you have master and slave nodes. The master(s) coordinate the slaves and the slaves do all the work. You also have to create a HDFS filesystem for Hadoop to use. Once you have Hadoop setup on all the nodes, configured your hadoop-site.xml
and created your HDFS filesystem you should be ready to start up your cluster. Needless to say there are many small details that I am not mentioning, the docs above do a pretty good job or outlining those so I won't repeat them here.
There are a few things I did differently than the docs. First, I installed Hadoop in a shared partition that all my nodes can access. This should make upgrades fairly simple. My 'conf' directory is also symlink'ed in as well as the version that I am currently using:
[cluster@front hadoop]$ pwd
[cluster@front hadoop]$ ls -la
drwxrwxr-x 7 cluster cluster 4096 Apr 22 13:39 .
drwxrwxr-x 8 cluster ucluster 4096 Apr 21 11:57 ..
drwxr-xr-x 2 cluster cluster 4096 Apr 22 13:40 conf
lrwxrwxrwx 1 cluster cluster 13 Apr 22 11:46 current -> hadoop-0.16.3
drwxrwxr-x 4 cluster cluster 4096 Apr 22 13:45 data
drwxr-xr-x 12 cluster cluster 4096 Apr 21 12:08 hadoop-0.15.3
drwxr-xr-x 12 cluster cluster 4096 Apr 22 13:16 hadoop-0.16.3
drwxrwxr-x 2 cluster cluster 4096 Apr 21 14:06 input
[cluster@front hadoop]$ echo $HADOOP_HOME
[cluster@front hadoop]$ ls -al current/conf
lrwxrwxrwx 1 cluster cluster 8 Apr 22 11:46 current/conf -> ../conf/
I also created '/opt/share/hadoop/data' and '/opt/share/hadoop/input'. 'data' is my HDFS store. 'input' is the location where I store the input that is eventually copied into the HDFS store.
I also modified my hadoop-metrics.properties
configuration to turn on the Ganglia monitoring. I'll admit that I haven't got it properly working with Ganglia yet and the documentation is fairly sparse. If you have any suggestions on how do this out side of whats on this page
let me know.
Once everything is rolling you should be able to 'start-all.sh' to startup the Hadoop daemons on all nodes. From there you can submit jobs, below is the wordcount example:
[cluster@front hadoop]$ /opt/share/hadoop/current/bin/hadoop jar /opt/share/hadoop/current/hadoop-0.16.3-examples.jar wordcount input output-wordcount6
08/04/22 14:18:27 INFO mapred.FileInputFormat: Total input paths to process : 6
08/04/22 14:18:28 INFO mapred.JobClient: Running job: job_200804221413_0001
08/04/22 14:18:29 INFO mapred.JobClient: map 0% reduce 0%
08/04/22 14:18:42 INFO mapred.JobClient: map 21% reduce 0%
08/04/22 14:18:44 INFO mapred.JobClient: map 40% reduce 0%
You can check your jobs and see how things are running by hitting 'localhost:50070' and 'localhost:50030' in a browser while Hadoop is running.
Lastly, I attempted to also create jobs to run hadoop in Sun Grid Engine, which I spoke about installing with Unicluster in this post
. This worked like a charm, the normal q* commands and etc work like you would expect and the jobs run properly.
Hadoop is a pretty sweet utility, its no surprise that large internet search companies could use this to their advantage. We will likely see a lot more from Hadoop and the MapReduce framework in the future.
Edit: Check out, http://www.joeandmotorboat.com/2008/07/28/more-on-hadoop-metrics-in-ganglia/ for more info on ganglia and hadoop metrics.