How-to: Deploy Hadoop Clusters Automatically with Dell Crowbar and Cloudera Manager
The following guest post, from Mike Pittaro of Dell’s Cloud Software Solutions team, describes his team’s use of the Dell Crowbar tool in conjunction with the Cloudera Manager API to automate cluster provisioning. Thanks, Mike!
Deploying, managing, and operating Apache Hadoop clusters can be complex at all levels of the stack, from the hardware on up. To hide this complexity and reduce deployment time, since 2011, Dell has been using Dell Crowbar in conjunction with Cloudera Manager to deploy the Dell | Cloudera Solution for Apache Hadoop for joint customers.
Cloudera Manager does a great job of deploying and managing the Hadoop layers of a cluster, but it depends on an operating system to be in place first. Meanwhile, to complement those capabilities, Dell Crowbar is a complete automated operations platform, designed to deploy layers of infrastructure on bare-metal servers and all the way up the stack.
In the Dell | Cloudera Solution, we use Crowbar to provision the hardware, configure it, and install Red Hat Enterprise Linux and Cloudera Manager — then, Cloudera Manager takes over to guide the user to a functioning cluster. Furthermore, we use the Cloudera Manager API to entirely automate the cluster setup process.
In this post, I’ll provide more details about how we have successfully integrated the Cloudera Manager API with Dell Crowbar.
Dell Crowbar, inspired by DevOps principles, is based on the concept of defining hardware and software configuration in code, just like applications. Crowbar uses a modular approach, where each component of the stack is deployed as an independent unit. The definition of each component is a Crowbar module called a “Barclamp”.
Figure 1 shows the Barclamps included in a typical Hadoop installation — including hardware configurations and functions like DNS and NTP — all the way up to the Cloudera components.
Crowbar Barclamps for Hadoop
During the deployment process, the Crowbar interface is used to create a “proposal” based on a Barclamp. Within the proposal editor, the hardware nodes are assigned their intended roles, and then the proposal is saved and applied. A single node can have several roles assigned depending on its function, and Barclamps are aware of dependencies between roles. For example, in order to deploy CDH, the single Cloudera Manager proposal is applied and subsequently Crowbar takes care of all the other requirements.
Cloudera Manager Barclamp
Figure 2 shows the Cloudera Manager proposal within Crowbar. The Barclamp defines seven roles available for nodes within the cluster:
- Clouderamanager-cb-adminnode – the node running the Crowbar
- Clouderamanager-server – the node running the Cloudera manager server
- Clouderamanager-namenode – nodes running the name server, whether active/passive or quorum HA is being used.
- Clouderamanager-datanode – the cluster data nodes
- Clouderamanager-edgenode – an edge or gateway node for client tools
- Clouderamanager-ha-journaling node – a quorum-based journaling node for quorum HA
- Clouderamanager-ha-filernode – an NFS filer node, for active/passive HA using a shared NFS mount
In the Crowbar interface, available hardware nodes are dragged to the appropriate roles in the proposal, and then the proposal is applied. At that point, Crowbar analyzes dependencies and makes any required changes to the nodes. If the hardware nodes are new, these changes might involve hardware configuration and a complete OS install. If the nodes already have an OS installed, the changes simply might involve installing some additional packages.
In the proposal editor, the roles defined within Crowbar closely correspond to the typical roles of nodes in a Hadoop cluster and can be mapped almost directly to their corresponding Hadoop services within Cloudera Manager. So, we decided to use this information to integrate with the Cloudera Manager API.
The integration is enabled by setting the deployment mode in the proposal to “auto mode”, as shown in Figure 3. In auto mode, applying the proposal will configure the hardware and OS as necessary, install Cloudera Manager on the edge node, install the agents on the remaining nodes, and then use the Cloudera Manager API to create the cluster based on the Crowbar roles. The install also sets up a local Yum repository for all the CDH packages, so the install can proceed without full Internet access.
Auto-deployment mode in the Cloudera Manager Barclamp
Behind the Scenes with the API
Crowbar is implemented primarily in Ruby, and the Barclamps are wrappers around Opscode Chef, which also uses Ruby for its recipes and cookbooks. This means that any automatic deployment implementation in Crowbar is also written in Ruby.
The Cloudera Manager API provides client libraries for Java and Python, but not Ruby, so the first step in implementing the integration was to create a Ruby client library for the API. The API is standards based, using REST and JSON as a data format, so this was a relatively straightforward process. The Ruby library parallels the Python implementation. (The library is currently Apache licensed as part of Crowbar, but could be split out in the future if there’s community interest.)
The actual cluster deployment logic is in the file cm-server.rb. It turns out to be relatively straightforward, and follows the flow described in “How-to: Automate Your Hadoop Cluster from Java”. The information about the nodes, their roles, and their configuration are already available in Crowbar, so the code primarily iterates through the Crowbar data structures, and makes the calls to create the cluster, the HDFS service, and the MapReduce Service. Since Cloudera Manager also uses a role-based approach, this mapping turns out to be very clean.
There’s even some logic there to handle licensing. The Cloudera Manager API corresponds directly to Cloudera’s packaging, which includes Standard (free) and Enterprise (paid support, indemnification, and enterprise features, with a 60-day trial option) versions. If a Cloudera license is entered in the Crowbar proposal it is used; otherwise, the Enterprise trial license is activated.
Our current integration uses the free APIs for HDFS and MapReduce configuration. So far, the automatic deployment capability has significantly reduced the time to deploy a cluster, especially if there are a large number of nodes. More important, it eliminates the error-prone manual process of re-entering the node and role information into the Cloudera Manager wizard. (This is a new feature, and we’re still collecting feedback on enhancements.)
We currently set up the core HDFS and MapReduce services for the cluster. In the future, we will likely configure more services as part of the automatic deployment. Cloudera Manager role groupsprovide another interesting opportunity for further integration, since we have information about the actual hardware configuration in Crowbar, and can start grouping nodes together based on hardware-specific features.
In the meantime, Dell and Cloudera continue to collaborate on other new features and integration in order to help customers more easily set up and manage Hadoop clusters.
Mike Pittaro (@pmikeyp) is Principal Architect on Dell’s Cloud Software Solutions team. He has a background in high performance computing, data warehousing, and distributed systems, specializing in designing and developing big data solutions.