Hoop – Hadoop HDFS over HTTP

What is Hoop?

Hoop provides access to all Hadoop Distributed File System (HDFS) operations (read and write) over HTTP/S.

Hoop can be used to:

  • Access HDFS using HTTP REST.
  • Transfer data between clusters running different versions of Hadoop (thereby overcoming RPC versioning issues).
  • Access data in a HDFS cluster behind a firewall. The Hoop server acts as a gateway and is the only system that is allowed to go through the firewall.

Hoop has a Hoop client and a Hoop server component:

  • The Hoop server component is a REST HTTP gateway to HDFS supporting all file system operations. It can be accessed using standard HTTP tools (i.e. curl and wget), HTTP libraries from different programing languages (i.e. Perl, JavaScript) as well as using the Hoop client. The Hoop server component is a standard Java web-application and it has been implemented using Jersey (JAX-RS).
  • The Hoop client component is an implementation of Hadoop FileSystem client that allows using the familiar Hadoop filesystem API to access HDFS data through a Hoop server.

Hoop and Hadoop HDFS Proxy

Hoop server is a full rewrite of Hadoop HDFS Proxy. Although it is similar to Hadoop HDFS Proxy (runs in a servlet-container, provides a REST API, pluggable authentication and authorization), Hoop server improves many of Hadoop HDFS Proxy shortcomings by providing:

  • Support for all HDFS operations (read, write, status).
  • Cleaner HTTP REST API.
  • JSON format for status data (files status, operations status, error messages).
  • Kerberos HTTP SPNEGO client/server authentication and pseudo authentication out of the box (using Alfredo).
  • Hadoop proxy-user support.
  • Tools such as DistCP could run on either cluster.

Accessing HDFS files -via Hoop- using Unix ‘curl’ command

Assuming Hoop is running on http://hoopbar:14000, the following examples show how the Unix ‘curl’ command can be used to access data in HDFS via Hoop using pseudo authentication.

Getting the home directory:

Reading a file:

Writing a file:

Listing the contents of a directory:

Click this link for more details about the Hoop HTTP REST API.

Getting Hoop

Hoop is distributed with an Apache License 2.0.

The source code is available at http://github.com/cloudera/hoop.

Instructions on how to build, install and configure Hoop server and the rest of documentation is available at http://cloudera.github.com/hoop.

Contributing Hoop to Apache Hadoop

The goal is to contribute Hoop to Apache Hadoop as the next generation of Hadoop HDFS proxy. We are just waiting on the Mavenization of Hadoop Common and Hadoop HDFS which will make integration easier.

Filed under:

19 Responses
  • Adrian / July 20, 2011 / 5:37 PM

    looks very exciting! WRT rest api, it would be nice to re-use http Range headers as opposed to a new offset syntax. Regardless, I’m very interested in adding Hoop support to jclouds.

  • Alejandro Abdelnur / July 21, 2011 / 1:07 PM

    Thanks, glad to hear you find Hoop useful.

    Regarding your question about using range http-headers. We could easily add support for it, but I favor keeping the current query-string offset/len parameters as it is easier to use and keeps the request self-contained in the URL.

  • Keith / August 10, 2011 / 3:25 AM

    Server Dependencies not accessible for hoop-server, hoop-testng, hoop-webapp . 404 response on links:

    https://github.com/cloudera/hoop/hoop-server
    https://github.com/cloudera/hoop/hoop-testng
    https://github.com/cloudera/hoop/hoop-webapp

  • Alejandro Abdelnur / August 10, 2011 / 1:34 PM

    Keith,

    Because the way Hoop documentation is being built those links should not be. Those links are generated by the maven dependencies report assuming that the documentation is aggregated from different projects. It is not the case in the current build logic.

    Thanks for pointing this out, we’ll see how to take care of this to avoid confusion.

  • Jon / February 13, 2012 / 10:25 AM

    hoop-client doesn’t appear to be in your maven repository as stated in the documentation:

    https://repository.cloudera.com/content/repositories/releases

  • Alejandro Abdelnur / February 16, 2012 / 9:36 AM

    Hi Jon,

    Hoop as been contributed to Apache Hadoop and it is part of it since Apache Hadoop version 0.23.0.

    As we never did a release in Github, artifacts were never published it to the Maven repos. The documentation was not accurate, apologies for that.

    I’d suggest you to checkout HttpFS from Apache Hadoop, the HTTP REST API has been refined and it is compatible with WebHdfs.

    Thanks.

  • Rads / September 12, 2012 / 12:20 AM

    Hello ,
    Does Hoop work with Https ? Can I use Hoop to read/write data from/into HDFS over HTTPS using Hoop?

  • Alejandro Abdelnur / September 17, 2012 / 9:50 AM

    Hoop is not maintained any longer, instead you should use Hadoop HttpFS, available in Hadoop 2 and CDH4. There is also a backport for CDH3, details at:

    http://blog.cloudera.com/blog/2012/08/httpfs-for-cdh3-the-hadoop-filesystem-over-http/

    Regarding your question about using Hoop (or HttpFS) over HTTPS. It would be possible, just configure the webserver to do HTTPS. Keep in mind that the Java API for webhdfs:// still does not support SSL, so you’ll be able to use tools like curl but not Hadoop FileSystem API against webhdfs://

Leave a comment


eight − 6 =