Running CDH 5 on GlusterFS 3.3

Categories: CDH Community Guest

The following post was written by Jay Vyas (@jayunit100) and originally published in the Community.

I have recently spent some time getting Cloudera’s CDH 5 distribution of Apache Hadoop to work on GlusterFS 3.3 using Distributed Replicated 2 Volumes. This is made possible by the fact that Apache Hadoop has a pluggable filesystem architecture that allows the computational components within the CDH 5 distribution to be configured to use alternative filesystems to HDFS. In this case, one can configure CDH 5 to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), which allows it to run on GlusterFS 3.3. I’ve provided a diagram below that illustrates the CDH 5 core processes and how they interact with GlusterFS.

Running a Single CDH 5 Deployment on One or More GlusterFS Volumes

Given that the CDH 5 distribution is comprised of other components besides YARN and MapReduce, I used the Apache Bigtop System Testing Framework to explicitly validate that Apache Sqoop, Apache Flume, Apache Pig, Apache Hive, Apache Oozie, Apache Mahout, Apache ZooKeeper, Apache Solr and Apache HBase also ran successfully.

Work is Still in Progress to Enable the Use of Impala

If you would like to participate in accelerating the work on Impala, please reach out to us on the Gluster mailing list

Implementation details for this solution and the specific setup required for all the components are available on the glusterfs-hadoop project wiki. If you have additional questions, feel free to reach out to me on FreeNode (IRC handle jayunit100), @jayunit100 on twitter, or via the Gluster mailing list.


3 responses on “Running CDH 5 on GlusterFS 3.3

  1. Justin Miller

    What are your thoughts on how this would affect performance? Wouldn’t a significant amount of overhead come with going fuse -> gluster vs local disk?

    What is the benefit over HDFS ? Are the benefits inherent in the capabilities of GlusterFS itself ? I am still a newbie to gluster but I find it very interesting.

  2. Jay Vyas

    Hi Justin. I’m not sure about “advantages” vs “disadvantages” of GlusterFS vs HDFS – GlusterFS a different FS with a totally different model for replication, security, and so on. Since its fully POSIX based, there are certain conveniences in the area of security and usability. The nice thing is that GlusterFS’s hadoop plugin is mature enough to handle the entire hadoop ecosystem – and likewise – that CDH5 is an HCFS compatible hadoop distribution.

  3. Justin Miller

    Thanks for your article and your response Jay! I look forward to learning more about Gluster. I love having options!