How-to: Write Apache Hadoop Applications on OpenShift with Kite SDK

Categories: Cloud Hadoop How-to Kite SDK

The combination of OpenShift and Kite SDK turns out to be an effective one for developing and testing Apache Hadoop applications.

At Cloudera, our engineers develop a variety of applications on top of Hadoop to solve our own data needs (here and here). More recently, we’ve started to look at streamlining our development process by using a PaaS (Platform-as-a-Service) for some of these applications. Having single-click deployment and updates to consistent development environments lets us onboard new developers more quickly, and helps ensure that code is written and tested along patterns that will ensure high quality.

The PaaS we’ve chosen is Red Hat OpenShift. OpenShift is an open hybrid solution, which means it’s open source and can be either self-hosted or accessed as a hosted solution. To date, we’ve been able to prototype our examples in the hosted solution, but are moving toward self-hosting on our internal OpenStack infrastructure.

Kite: An API for Data

The Kite SDK is an open source toolkit for building Hadoop applications. The data module provides APIs and tools for working with datasets without the developer having to know about low-level details such as file formats, layout on disk, and so on. Kite provides two main implementations of the dataset API: one for Apache Hive (where data is stored in HDFS) and one for Apache HBase.

We have written two Kite applications that run on OpenShift. The first (the Kite OpenShift Logging Example) uses a Hive dataset, and the second (the Kite OpenShift Spring HBase Example) uses an HBase dataset.

In the remainder of this post, we’ll describe the development lifecycle for these applications. To learn more about using the Kite APIs and tools to write Hadoop applications, consult the Kite SDK documentation, the Kite Examples, and the Kite OpenShift Example code (referenced in the previous paragraph).

Development Miniclusters

One of the challenges of developing Hadoop applications is the slow development cycle caused by the overhead of configuring a Hadoop cluster on which to do testing. For example, setting up a data pipeline involves a lot of manual steps such as configuring Apache Flume agents and setting up HDFS permissions. Running a Hadoop cluster in a local VM (like Cloudera’s QuickStart VM) is a lot more convenient, but there are still multiple configuration steps to run through.

The approach that projects in the Hadoop ecosystem take for internal tests is to use miniclusters—lightweight clusters that are fully functional yet run within the same JVM as the test code. They start up quickly and can be configured programmatically. However, being an internal Hadoop test artifact, miniclusters aren’t very approachable for developers who want to write applications that run on Hadoop. In addition, they are not easy to use together, so hooking up a Hive minicluster to an HDFS minicluster, for example, requires some knowledge of the internals of both.

We decided that it would be very useful to have a cohesive minicluster that can run multiple services simultaneously, so we set about wiring the various component miniclusters together. At the time of this writing, our minicluster can run services for HDFS, Apche ZooKeeper, HBase, Hive (both Metastore and HiveServer2), and Flume.

Starting the minicluster using the Java API is straightforward and uses a builder pattern:

This snippet starts a minicluster running HDFS, Hive, and Flume. The Flume agent called tier1 is configured using the flume.properties file from the classpath.

The HDFS and Hive services do not need any configuration by default, but it is easy to configure them by placing the standard hdfs-site.xml and hive-site.xml files on the classpath.

A minicluster stores its working state in a local directory (/tmp/kite-minicluster by default, overridden via the workDir() method). The clean() method on the builder instructs the minicluster to start afresh by clearing the working directory before starting.

You can see an example of using a minicluster embedded in a web app in the Kite OpenShift Logging Example. The web app uses a ServletContextListener to start and stop the minicluster as a part of the webapp’s lifecycle. The main part of the web app uses Kite’s dataset API to write data into Hadoop using Flume and then displays the data on a web page using the Hive JDBC driver.

The Minicluster CLI

Kite has a CLI for running the minicluster, which can be useful for quickly spinning up a cluster locally during development. The CLI is also used by the OpenShift Hadoop cartridge described to follow.

The following CLI invocation is equivalent to the Java snippet above:

The Minicluster on OpenShift

A PaaS provides the infrastructure required for applications to run. This could include web frameworks, databases, monitoring services, and more. In the OpenShift world, the container through which these components are deployed is called a cartridge. To allow a database to be deployed to OpenShift, a cartridge must exist that is capable of installing, configuring, and running that database. The Kite GitHub repository now includes an OpenShift Minicluster cartridge.

Using the Kite minicluster cartridge is a great alternative to embedding the minicluster in an application. One large benefit is faster deployment times of the application, which is crucial for maintaining an agile development workflow. Another benefit is that you can run multiple applications without having dependencies on the application running the embedded minicluster.

You can see an example Spring MVC web application using the Kite minicluster cartridge deployed to OpenShift. The application has a Maven build profile called openshift that will set the connection settings to environment variable values that the Kite minicluster cartridge exports on the OpenShift system. The OpenShift application deployment workflow will build the application locally using the openshift build profile by default.

The biggest challenge to getting the minicluster to run in OpenShift was making it compatible with IP and port binding permission restrictions OpenShift puts in place. There is a requirement that applications running in OpenShift can only bind to a private IP address that can be found in an environment variable on the system. Binding to the wildcard address (0.0.0.0) or localhost (127.0.0.1) produces a “permission denied” error. The various component miniclusters that the Kite minicluster wraps are sometimes hardcoded with configurations that can’t be overridden. The IP addresses to which they bind is one such hardcoded configuration that proved problematic for running in OpenShift.

We were able to get around these restrictions with a variety of techniques. One includes creating a Hadoop configuration implementation that forces the IP bind configuration to be the value we set it to, even when the internal minicluster code tries to override it. In other cases we had to re-implement some of the component minicluster code.

After running the minicluster with the command ./kite-minicluster run hdfs zk hbase -b 192.168.0.135, you will find that all sockets opened for listening in the process will be bound to the IP 192.168.0.135.

Trying the Examples on OpenShift

To try out the web example in OpenShift, you first need to sign up for an OpenShift account and install the client tools. Due to the resource requirements of the minicluster, you need to have an account that allows you to run larger instances. (This is not possible with the free account.) Next, create a web application called “snapshot” with the following commands:

This app create command a barebones web application with the Kite minicluster running as a cartridge for the web application to access. This command will also create a git repository on your local host in the directory from which you ran this command. When you add and push your web application code to this git repository it will be deployed to the environment. The cartridge storage command will increase the amount of storage space to which the web application has access. It needs to be raised so it has enough space to pull in the dependencies on build.

Next, add your code to the repository and push it to the server. Do that with the following commands:

First move into the git repository, and then add the kite-spring-hbase-example repository as an upstream remote repo. This step will allow you to merge that code into the empty repo with the git pull command. Finally, push that code to the OpenShift server, which will cause OpenShift to build and deploy the code you pushed.

Those of you who have experience with OpenShift may be wondering why you can’t just add the URL of the example repository to the app create command in step 1, which would tell OpenShift to use that repository as the base. The answer to that question is that Kite and Hadoop have many dependencies that need to be downloaded by Maven when building, and the app create command has a timeout of ~230 seconds. That isn’t enough time to download all the dependencies when the application is built for the first time, causing the command to fail. So, this limitation necessitates the workaround of adding the code to the empty repository as explained above.

Conclusion

The two examples that are compatible with OpenShift (logging and web) have directions in their project READMEs that detail how to deploy them. These examples are a great way to get started with both Kite and OpenShift, and provide a solid foundation on which you can build your own applications.

The routine steps for beginning development on top of the Hadoop stack required either an already existing (and properly configured) Hadoop cluster to be set up or installation of a VM on a developer’s local workstation. But that approach takes both resources and time for proper bootstrapping. The combination of Kite’s minicluster and OpenShift provides a much lower barrier to entry to developing your own applications on top of this powerful stack.

Adam Warrington is an Engineer Manager on the customer operations team at Cloudera.

Tom White is a Software Engineer at Cloudera, a committer/PMC member for Apache Hadoop and multiple other Hadoop ecosystem projects, and the author of the popular O’Reilly Media book, Hadoop: The Definitive Guide.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail