Capacity Planning with Cloudera Manager

Categories: Cloudera Manager General

If you’re like a myriad of other systems administrators out there, you may be running a production Hadoop cluster, spec’ing one out, or just starting to investigate the possibility of bringing Hadoop into your workplace. As any of these folks will be able to tell you, one of the most important tasks you’ll encounter is capacity planning. With the release of Cloudera Manager 3.7, we’re bringing you a new set of tools to aid you in this process. In this post, we’ll take a look at how you can leverage Cloudera Manager to deal with some common scenarios that you might run into while planning out a Hadoop cluster.

Questions and Patterns

How is my disk usage growing over time?

One very interesting disk usage pattern can be seen in Josh’s recent blog post on his analysis of drug interactions. Josh started with a relatively small data set, containing about one million records. However, during one of the stages of his analytic process, the number of records was blown up from one million to three trillion. Many types of analyses can result in very large intermediate data sets, while the final output may just be a fraction of the intermediate data. The consequence is that there are temporary spikes in disk usage, which need to be understood, in order to appropriately plan out a Hadoop deployment.

Maybe you want to understand the rate at which your data is growing. Perhaps a Flume installation is constantly streaming new files to HDFS, additional business units have expressed an interest in getting data into the cluster, or more users are running more jobs, resulting in more data landing on disk. It’s useful to be able to characterize the growth rate of your data within the cluster.

Using Cloudera Manager, you can view historical disk usage reports. A local maximum like the one Josh experienced is shown below, as well as data growing over time into a global maximum. Being able to visualize this growth makes it easy to determine how long your free disk space will last.

A historical disk usage report generated by Cloudera Manager

What does my disk usage look like right now?

A simpler, more common question that a sysadmin might ask is: “How much disk are we using, and who’s to blame?” Cloudera Manager provides a set of operational reports to look at the current state of your Hadoop cluster to see where disk space is going in easy-to-digest bar charts. All it takes is a click of the mouse to pull up a report on how much disk space each user, group, or directory is using.

Cloudera Manager provides snapshots of the current disk usage for a cluster. Admins can see how much data is being used, in terms of bytes in HDFS, the raw bytes on physical disk (accounting for replication factors), and file counts. On the right-most chart, we can see that one of the users owns a very large amount of files, which could potentially bog down MapReduce jobs. Using this chart, it’s a simple task to identify the largest consumers of disk resources.

A report of current disk usage from Cloudera Manager

How is MapReduce being used?

At the other end of the spectrum, it’s important to understand the types of jobs that users are running, and which users are using more than their share of cluster resources for executing jobs. By looking at a chart like the one below, Hadoop administrators can get a quick view of which users are utilizing the cluster, how much reading and writing their jobs are doing, how many map and reduce slots their jobs are using, and how long their jobs have been running on the cluster, to name a few useful metrics.

A MapReduce usage report from Cloudera Manager

Cloudera Manager Will Help You

Using Cloudera Manager, you’ll be able to get snapshots of the file system, identify trends in data growth, and aggregate that information by users, groups, or directories that are interesting to you, in order to quickly identify the biggest consumers of resources within the cluster. You’ll be able to discover how individual users are utilizing the cluster by looking at aggregated MapReduce usage statistics, and if you need to do further number crunching, all it takes is a click to export your reports to CSV or Excel spreadsheets, making your data portable and easy to manipulate. If you find yourself needing to ask questions like the ones outlined above, Cloudera Manager will help you be successful in building your cluster.