With Cloudera Director, cloud deployments of Apache Hadoop are now as enterprise-ready as on-premise ones. Here’s the technology behind it.
As part of the recent Cloudera Enterprise 5.2 release, we unveiled Cloudera Director, a new product that delivers enterprise-class, self-service interaction with Hadoop clusters in cloud environments. (Cloudera Director is free to download and use, but commercial support requires a Cloudera Enterprise subscription.) It provides a centralized administrative view for cloud deployments and lets end users provision and scale clusters themselves using automated, repeatable, managed processes. To summarize, the same enterprise-grade capabilities that are available with on-premise deployments are now also available for cloud deployments. (For an overview of and motivation for Cloudera Director, please check out this blog post.)
In this post, you’ll learn about some of the technology behind Cloudera Director and why one would use it.
From the outset, Cloudera Director was designed to be cloud-neutral, which translates to support for different cloud providers (as well as both private and public clouds). Its data model is therefore abstracted away from the specific architecture of any single provider. Here are some of the key concepts in that model.
- Environment – An environment maps to a cloud provider. Each environment has a unique name that you provide, and can contain configuration data and SSH credentials specific to your provider account. Today, Cloudera Director supports Amazon Web Services (AWS), but support for more providers is planned for the future.
- Instance – An instance represents a computing resource that you provision from the cloud provider. An instance is generated from an instance template, which gives specifications such as memory size, storage capacity and type, and operating system. Under AWS, for example, details such as the EC2 instance type and AMI describe the instance.
- Deployment – A deployment maps to an instance of Cloudera Manager, Cloudera’s management application for Hadoop and enterprise data hubs. A deployment is hosted by an environment and resides on an instance that you specify.
- Cluster – A cluster defines the instances that run the components of your enterprise data hub, such as HDFS, YARN, Apache HBase, Apache Hive, and Impala. Each cluster that you define is created and managed by a deployment. So, after Cloudera Director has readied a cluster, Cloudera Manager capabilities such as monitoring, security configuration, and auditing are available right away. You may host multiple clusters under a single deployment.
Cloudera Director works as the interface to your cloud provider (environment) by working with the provider-specific API to create, replicate, and terminate deployments and clusters. This approach lets you interact with cloud-hosted clusters just as you would with on-premise clusters, while benefiting from the advantages of cloud computing.
Cloudera Director includes a server component that you can use as a central location for your administrators and users to manage cloud deployments. The server is designed around an API that provides access to the complete set of capabilities Cloudera Director has to offer.
The Cloudera Director API is designed using RESTful principles, using JSON as a data interchange format. Service requests and responses are served over HTTP, with TLS as an option. Documentation on the API, generated using Swagger, is hosted on the server itself. The Swagger API console includes live forms that developers can use to explore, design, and troubleshoot their work.
Clients can interact with the API to manage environments, deployments, clusters, instance templates, and users of Cloudera Director. Calls are made to create, read, update, and delete each of these items, and Cloudera Director handles the details.
By default, the server enforces user authentication and authorization using a simple internal user database. Clients can use HTTP basic authentication or access a specific “login” service to authenticate and go on to make more calls. The API itself gives administrators access to tailor user accounts and add new ones to fit their needs.
The Cloudera Director server hosts a UI available through your browser. The UI dashboard shows you at a glance the set of environments that are available and a list of deployments (Cloudera Manager instances) and clusters managed by each deployment. From the dashboard, you can perform many of the same actions that are available through the API.
In fact, a key design feature of the UI is that it relies completely on the API to work. That means that developers can be sure they have access to the full range of capabilities in Cloudera Director when they code to the API. It also ensures that users working through the UI and the API are acting on the same cluster information.
The UI does offer some features over and above the API that make working with Cloudera Director easier:
- Wizards are integrated into the UI to make the process of defining new environments, deployments, and clusters straightforward.
- Special wizards are also available for the higher-level tasks of adding compute nodes to a cluster (“growing” a cluster) and cloning a cluster from an existing one.
- The UI performs some client-side validations to help guide users.
In addition to the server component, Cloudera Director provides a client tool. The client provides many of the same capabilities as the server but in a standalone form; you can use the client to stand up, check status on, update, and terminate clusters. The client is a good choice for integrating with scripts, build servers, and other automated tools.
The client gets its cluster definitions from a configuration file written using the HOCON data format, which is based on JSON. The configuration file is a blueprint that completely describes the environment, deployment, and instances for a cluster. Because the configuration file is plain text, it is amenable to being stored in version control systems for tracking and control purposes.
While the client can be used on its own without the server, the client can ask the server to stand up a new cluster using its
bootstrap-remote command. As with the user interface, the client uses the server API for this operation, meaning that the new deployment information is integrated into the server.
The diagram below illustrates how components in Cloudera Director interact. While AWS is the cloud provider shown, the picture would look very similar with any other cloud provider.
As you have now learned, Cloudera Director is built on a solid technical foundation for managing the enterprise data hub on your cloud provider. Future releases will build on this foundation with new and expanded features, including:
- Support for more cloud providers
- Expanded automation of typical deployment setups
- More self-service capabilities
- Pre-built API clients for various programming languages
Bill Havanki is a Software Engineer at Cloudera.