Pig is a query language for use with Hadoop. It allows users to query hadoop data similar to a SQL database. Formally, according to their website:
Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
To get rolling you need the following: Once you are rolling with those items we can install Pig and test it out. First, you need to download Pig from their Subversion repository. Once done you will need to build it with Ant.
svn co pig-svn cd pig-svn ant
From there you can run the following command to drop into the interactive shell.
java -cp pig.jar:HADOOPSITEPATH org.apache.pig.Main
Or you can run a pig script that you have already created.
java -cp pig.jar:HADOOPSITEPATH somescript.pig
HADOOPSITEPATH needs to point to the directory that contains the hadoop-site.xml file. If you run into an issue such as:
Caused by: org.apache.hadoop.ipc.RPC$VersionMismatch: Protocol org.apache.hadoop.dfs.ClientProtocol version mismatch. (client = 29, server = 23)
You will need to upgrade Hadoop so the versions match. In the end you should get something that looks like this:
[cluster@front pig-svn]$ java -cp pig.jar:HADOOPSITEPATH org.apache.pig.Main 2008-05-23 10:37:42,478 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: front.esper:9000 2008-05-23 10:37:42,585 [main] WARN org.apache.hadoop.fs.FileSystem - "front.esper:9000" is a deprecated filesystem name. Use "hdfs://front.esper:9000/" instead. 2008-05-23 10:37:43,117 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: front.esper:9001 2008-05-23 10:37:43,246 [main] WARN org.apache.hadoop.fs.FileSystem - "front.esper:9000" is a deprecated filesystem name. Use "hdfs://front.esper:9000/" instead. grunt>
If you need more info on the above steps check out the Pig Wiki. From here you can follow their tutorial or play around in the shell. Regarding the tutorial, I can't seem to find the download of the archive they mention "Pig tutorial file (*.gz)". If anyone knows where that can be found let me know and I will post it.
