HttpFS for CDH3 – The Apache Hadoop FileSystem over HTTP

HttpFS is an HTTP gateway/proxy for Apache Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).

HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API. HttpFS also allows clients to access CDH3u4 clusters via the WebHDFS REST API.

Given the constant interest we’ve seen by CDH3 users in Hoop, we have backported Apache Hadoop HttpFS to work with CDH3.

Providing a bit of background, Hoop has been contributed to Apache Hadoop and is now named HttpFS. Hoop was a preview technology, and when it was contributed to Apache Hadoop it underwent significant REST API changes (http://archive.cloudera.com/cdh4/cdh/4/hadoop/hadoop-yarn/hadoop-yarn-site/WebHDFS.html). HttpFS is part of Apache Hadoop 2.x and of CDH4.

CDH3 users can now use HttpFS instead of Hoop for HDFS access over HTTP. Using HttpFS facilitates a later upgrade to CDH4 as the HttpFS API in CDH3 is compatible with HttpFS in CDH4 and does not require application code changes when upgrading to CDH4.

HttpFS is not distributed with CDH3. The source for the HttpFS backport for CDH3 is available at https://github.com/cloudera/httpfs/. There is one branch for CDH3u4 and another branch for CDH3u5.

Limitations: The HttpFS CDH3 backport does not implement the delegation token operations. Delegation tokens are used by tools like DistCP when reading files from a secure cluster.

Filed under:

No Responses

Leave a comment


× 3 = twenty four