Joe Williams home
So I began building a new head cluster node in a KVM, just as a test run and to refine my methodology. I decided to drop Unicluster due to an unresolved issue, this time around I decided to install everything myself. ... Java, check ... Hadoop, check ... Pig, check ... Grid Engine, check ... OpenMPI, check ... Ganglia, ugh ... Ganglia seems to be an interesting beast. I build the SRPMs and then installed the RPMs for the "ganglia monitor core" without a problem, it was easy and quick. I then moved on to the "gexec execution environment" this includes gexec, gexecd, authd and libe. The first issue I ran into in building from the SRPM was the dependencies. First, I started with authd and ran into dependency issues during the build. Sadly the SPEC file did not include what the package requires. I attempted the normal RPM (found on Ganglia' SourceForge page). Even those didn't work properly due to a requirement of some old OpenSSL libraries unavailable in Centos5.
[root@m ganglia]# rpm -qa | grep openssl openssl-devel-0.9.8b-8.3.el5_0.2 openssl-0.9.8b-8.3.el5_0.2 openssl-devel-0.9.8b-8.3.el5_0.2 openssl-0.9.8b-8.3.el5_0.2 [root@m ganglia]# rpm -ivh authd-0.2.1-1.i386.rpm error: Failed dependencies: libcrypto.so.2 is needed by authd-0.2.1-1.i386 libssl.so.2 is needed by authd-0.2.1-1.i386
So I went back to attempting to build the SRPM. Soon I found out that the above libraries have nothing to do with the build issues I was seeing. My issue was with the libe library missing. Once I built and installed that authd build and installed without a problem. Next, I attempted to build gexec. This proved to have the same issue as authd, the SRPM did not include a requires in the SPEC making it difficult to determine what needs to be installed as a prerequisite. I then started to investigate the errors I was seeing in the build,
gexec.c:39:33: error: ganglia/gexec_funcs.h: No such file or directory
Googling for this I found a Ganglia Developers email list entry that described that
The gexec-0.3.6 available from http://www.theether.org/gexec does not build with 3.0.* versions of Ganglia. It builds correctly only with 2.* versions. If you want to build with Ganglia 3, edit the gexec.c to include /usr/include/ganglia.h and not /usr/include/ganglia/gexec_funcs.h. Of course, you have to have ganglia-devel installed for this to work. Another thing, in addition to the above, you have to add #include to gexec.c in order to successfully build the gexec.
That works, so I edited the gexec.c source tarball containing the gexec.c including the above changes. My attempt to build again failed on the 'e/llist.h' include not existing. 'locate' proved that it did not exist on my machine even though libe is installed. So I went back to that email list post and found this link:
http://svn.oscar.openclustergroup.org/svn/oscar-soc/soc-2006/hpcmetrics/ganglia/
Looking through the source I found http://svn.oscar.openclustergroup.org/svn/oscar-soc/soc-2006/hpcmetrics/ganglia/src/lib/llist.h and copied it in to '/usr/include/e/'. This worked nicely, but as you might expect it failed again. This time looking for libraries in '/lib' rather than '/lib64', which is to be expected since I am running x86_64. I symlinked the library into place and moved on. Now I am at an error that I haven't been able to figure out. My mailing list post describing the issue has not seen a reply.
gexec.c: In function ‘main’: gexec.c:324: warning: ‘ips’ may be used uninitialized in this function gcc -DHAVE_CONFIG_H -I. -I. -I. -I. -O2 -Wall -D_REENTRANT -g -D_GNU_SOURCE -DDEBUG -c gexec_options.c gcc -O2 -Wall -D_REENTRANT -g -D_GNU_SOURCE -DDEBUG -o gexec -L. gexec.o gexec_options.o -lpthread -lgexec -le -lauth -lssl -lcrypto /usr/lib/libganglia.a -lssl -lpthread -lcrypto /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x10c): undefined reference to `XML_ParserCreate' /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x160): undefined reference to `XML_SetElementHandler' /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x16b): undefined reference to `XML_SetUserData' /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x178): undefined reference to `XML_GetBuffer' /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x1c4): undefined reference to `XML_ParserFree' /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x1f6): undefined reference to `XML_ParseBuffer' /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x265): undefined reference to `XML_GetErrorCode' /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x26c): undefined reference to `XML_ErrorString' /usr/lib/libganglia.a(ganglia.o): In function `gexec_cluster': (.text+0x277): undefined reference to `XML_GetCurrentLineNumber' collect2: ld returned 1 exit status make: *** [gexec] Error 1
After a bit of Googling, I found that these XML directives are related to expat. I installed expat-devel (as well as a number of other xml devel packages) and attempted to rebuild. Same thing, failure. Next, I decided that since it seems in relation to libganglia.a that perhaps it was not built with expat support and needed to rebuilt, so now with expat-devel installed I did this. This fails with the same error as above. After looking at the doc I noticed that the ganglia SPEC file does not include '--enable-gexec' in the configure. I built the RPMs with this option and still ran into the error. I have attempted to build gexec from SRPM as well as straight source. In every case I get the above error. The error suggests ("collect2: ld returned 1 exit status") to me that there is a library (or libraries) missing. But at this point I'm not really sure at all. If I come up with something (outside of running gexec in standalone) I will be sure to post it. If anyone else out there knows what's up post a comment. This all leads me to the point of this post which is ... why is setting this up so difficult? Truth be told I have no clue, but I don't think it should be. The Ganglia mailing list was helpful enough but documentation seems a little lacking should one run into any issues. One would think that if "The gexec-0.3.6 available from http://www.theether.org/gexec does not build with 3.0.* versions of Ganglia." this should be documented. I don't think that I am doing anything strange and I am using Centos5, not some obscure distro. You may be asking what all these problems with gexec have to do with ganglia (a guy on the mailing list asked me just that "What does this have to do with ganglia?"), fair enough. Ganglia is not gexec and gexec is not Ganglia. My response was that the gexec SRPMs are downloadable side by site with all the Ganglia RPMs off of SourceForge. This leads me to believe that questions to the Ganglia mailing list about gexec doesn't seem too far off base. Additionally, for someone that is trying to install these packages for the first time or is new to Ganglia it seems that the mailing list would be the place to ask, as I imagine there are plenty of folks running gexec hosts in Ganglia. The Ganglia documentation even mentions gexec that "integrating it with ganglia is a bit clumsy" but provides no information outside of how to run it standalone mode and how to turn it off if you have configured it by default to be on. To boot the gexec site hasn't been updated since 2004. Next, you may think that if this is broken and the documentation sucks why don't you fix it, it's an opensource project. That's valid and I will be happy to write up some documentation on how to build the RPMs for Ganglia and associated applications. For good measure I will even see if I can get it posted to the Ganglia wiki. Of course this hinges on me actually being able to build the RPMs and have everything work properly. Lastly, here are a few lessons learned: That's it for my rant. Thanks. :)
Fork me on GitHub