How-to: Extend Cloudera Manager with Custom Service Descriptors

Thanks to Jonathan Natkins of WibiData for the post below about how his company extended Cloudera Manager to manage Kiji. Learn more about Kiji and the organizations using it to build real-time HBase applications at Kiji Sessions, happening on May 6, 2014, the day after HBaseCon.

As a partner of Cloudera, WibiData sees Cloudera Manager’s new extensibility framework as one of the most exciting parts of Cloudera Enterprise 5. Cloudera Manager 5.0.0 provides the single-pane view that Apache Hadoop administrators and operators want to effectively manage a cluster of machines. Additionally, Cloudera Manager now offers tight integration for partners to plug into the CDH ecosystem, which benefits Cloudera as well as WibiData.

When I heard the framework was being built, I jumped at the opportunity to integrate the Kiji framework with Cloudera Manager, and to enable operators to manage our components through an interface with which they’re already very familiar. In this article, I’ll explain how others can integrate with Cloudera Manager by developing a custom service descriptor (CSD).

Kiji is a set of primarily client-side libraries that provide a schema management, schema evolution, and real-time predictive model execution system for Apache HBase. It is generally used for developing personalized applications like recommendation systems. For the first version of the Kiji CSD, we enabled users to configure and operate KijiREST servers, which provide a REST interface to Kiji, through Cloudera Manager. Users can create a Kiji service, which allows them to assign KijiREST Server roles to hosts managed by Cloudera Manager, and maintain the servers through the CM UI.

Anatomy of a Cloudera Manager Service

At the core of the Cloudera Manager system is the concept of a service. For example, HDFS, MapReduce, HBase, and so on, are all distinct services. Each service has a set of daemons or servers that can be assigned to host machines in a cluster. In HDFS, this would be the NameNode, Secondary NameNode (if you’re not running in HA mode), DataNodes, and Gateways. All of these are collectively referred to as the roles for the HDFS service.

Both services and roles may have commands and configuration parameters assigned to them. A service command is typically a command that affects every role within a service instance, and a service configuration parameter is a parameter that either affects all the roles or is required by all the roles in a service.

Unsurprisingly, role commands and parameters only pertain to a single role within the service. For example, it might make sense for a Secondary NameNode to have a role command to force a checkpoint, but it makes less sense for that command to be applied to a DataNode. Similarly, HDFS root directory creation is a service command since it affects the entire service.

Gateways are a special type of role, since they aren’t actually a server or daemon. Rather, the Gateway role denotes that a host will be used by clients to access the service, often by submitting jobs from that node. Assigning a Gateway role to the host will allow client configurations for the service to be deployed to that host.

When Does a CSD Make Sense?

A CSD can be a very useful tool for a Hadoop cluster administrator. Many systems have auxiliary systems that are used in and around the Hadoop cluster, and there’s a lot of convenience in having a single pane of glass for an entire Hadoop application. This was a major reason why it made sense to build a CSD for Kiji.

CSDs are most useful if you have server components that you want to manage, but can also be useful for systems that need to be able to specify gateway machines. Using a CSD allows you to start and stop servers, define custom actions where necessary and monitor status from within Cloudera Manager. By defining a Gateway role, you also gain the ability to place relevant configuration files into a known location, so that end users can access the configs for their own use.

Structure of a CSD

At their core, CSDs are nothing more than a JAR file with a known directory structure:

jubjubbird:tmp natty$ ls
KIJI-1.0.jar
jubjubbird:tmp natty$ jar -xvvf KIJI-1.0.jar
  created: META-INF/
 inflated: META-INF/MANIFEST.MF
  created: descriptor/
  created: images/
  created: scripts/
 inflated: descriptor/service.sdl
 inflated: images/sushi.png
 inflated: scripts/control.sh

 

The META-INF directory is unexciting and packaged for the purposes of the JAR format, but the rest of the directories (descriptor, images, and scripts) define the CSD and how it operates.

The only required directory is descriptor containing the service.sdl file,  which contains the JSON defining the service, the roles that the service contains, any configuration parameters, commands that may be used on a service or role, service dependencies, etc. This file also contains a declaration of what icon is used by Cloudera Manager to represent the service, which in this CSD was contained in the images directory.

The scripts directory is used to separate out files and scripts that are necessary to perform any role and service commands. In the Kiji CSD, the control.sh file is specified as containing the program for the startRunner of the KijiREST Server role, which defines how a KijiREST Server is started on the cluster.

"startRunner" : {
         "program" : "scripts/control.sh",
         "args" : [
            "start_rest"
         ],
         "environmentVariables" : {
            "REST_PORT" : "${rest_port}",
            "REST_ADMIN_PORT" : "${rest_admin_port}",
            "REST_LOGGING_DIR" : "${log_dir}",
            "FRESHEN" : "${freshen}",
            "FRESHENING_TIMEOUT" : "${freshening_timeout}",
            "KIJI_URI" : "${kiji_cluster_uri}"
         }
      },

 

When KijiREST Server roles are assigned to hosts, the control.sh file will be transferred to the hosts, and when KijiREST Server roles are started, the agent on the associated host will run the control.sh script with the configured arguments and environment variables. The script is just an arbitrary shell script and the same concept can be used to design other service or role commands. For example, the HDFS service defines a command for formatting the filesystem when the service is initially created.

Looking Forward

The CSD framework opens up a lot of opportunities for organizations to utilize Cloudera Manager more heavily and more effectively. It opens the door for Cloudera Manager to be a full-stack application monitoring system, rather than just a system that tracks the health of application infrastructure. Personally, I’m looking forward to seeing where Cloudera goes with this framework, and am excited about the idea of being able to some day collect metrics for Kiji and chart them in Cloudera Manager, so that I can truly have a single pane of glass for application operations. If you’d like to learn more about how to build your own CSD, I’d strongly recommend looking at the CSD documentation, and the Apache Accumulo, Apache Spark, and Apache Sqoop CSDs from Cloudera.

Jonathan “Natty” Natkins is a Field Engineer at WibiData, working with customers to develop real-time personalization systems using Kiji. He’ll be speaking about building 360º views with Kiji at HBaseCon on May 5.

 

No Responses

Leave a comment


7 × five =