Pig is a query language for use with Hadoop. It allows users to query hadoop data similar to a SQL database. Formally, according to their website:
Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
To get rolling you need the following:
- A Java SDK Installed
- Ant Installed
- A working installation of Hadoop
Once you are rolling with those items we can install Pig and test it out.
First, you need to download Pig from their Subversion repository. Once done you will need to build it with Ant.
svn co http://svn.apache.org/repos/asf/incubator/pig/trunk pig-svn
From there you can run the following command to drop into the interactive shell.
java -cp pig.jar:HADOOPSITEPATH org.apache.pig.Main
Or you can run a pig script that you have already created.
java -cp pig.jar:HADOOPSITEPATH somescript.pig
HADOOPSITEPATH needs to point to the directory that contains the hadoop-site.xml file.
If you run into an issue such as:
Caused by: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.dfs.ClientProtocol version mismatch. (client = 29, server = 23)
You will need to upgrade Hadoop so the versions match.
In the end you should get something that looks like this:
[cluster@front pig-svn]$ java -cp pig.jar:HADOOPSITEPATH org.apache.pig.Main
2008-05-23 10:37:42,478 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: front.esper:9000
2008-05-23 10:37:42,585 [main] WARN org.apache.hadoop.fs.FileSystem - "front.esper:9000" is a deprecated filesystem name. Use "hdfs://front.esper:9000/" instead.
2008-05-23 10:37:43,117 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: front.esper:9001
2008-05-23 10:37:43,246 [main] WARN org.apache.hadoop.fs.FileSystem - "front.esper:9000" is a deprecated filesystem name. Use "hdfs://front.esper:9000/" instead.
If you need more info on the above steps check out the Pig Wiki
From here you can follow their tutorial
or play around in the shell
. Regarding the tutorial, I can't seem to find the download of the archive they mention "Pig tutorial file (*.gz)". If anyone knows where that can be found let me know and I will post it.