Hadoop’s NameNode, SecondaryNameNode, DataNode, JobTracker, and TaskTracker daemons all expose runtime metrics. These are handy for monitoring and ad-hoc exploration of the system and provide a goldmine of historical data when debugging.
In this post, I’ll first talk about saving metrics to a file. Then we’ll walk through some of the metrics data. Finally, I’ll show you how to configure sending metrics to other systems and explore them with jconsole.
Dumping metrics to a file
The simplest way to configure Hadoop metrics is to funnel them into a user-configurable file on the machine running the daemon. Metrics are organized into “contexts” (Hadoop currently uses “jvm”, “dfs”, “mapred”, and “rpc”), and each context is independently configured. Setup your conf/hadoop-metrics.properties to use FileContext like so:
# You’ll want to change the path
# Configuration of the “mapred” context for file
# Configuration of the “jvm” context for file
# Configuration of the “rpc” context for file
With this configuration Hadoop daemons will dump their metrics to text files at ten second intervals. Once your daemons are up and running, you’ll start seeing those log files populated.
Exploring the metrics data
Let’s explore some of the data. If you want to dig right in, I’ve uploaded the metrics log files from a Hadoop cluster running locally on my machine.
The “jvm” context contains some basic stats from the JVM: memory usage, thread counts, garbage collection, etc. All Hadoop daemons use this context (the implementation is in the JvmMetrics class). Because I’m running all the daemons locally for this blog post, my “jvmmetrics.log” has lines from all the deamons.
jvm.metrics: hostName=doorstop.local, processName=SecondaryNameNode, sessionId=, gcCount=11, gcTimeMillis=53, logError=0, logFatal=0, logInfo=12, logWarn=3, memHeapCommittedM=7.4375, memHeapUsedM=4.345642, memNonHeapCommittedM=23.1875, memNonHeapUsedM=16.32586, threadsBlocked=0, threadsNew=0, threadsRunnable=5, threadsTerminated=0, threadsTimedWaiting=4, threadsWaiting=2
In the “dfs” context, “dfs.FSNamesystem” contains information similar to what you’d expect from the NameNode status page: capacity, number of files, under-replicated blocks, etc.
The DataNode and NameNode daemons also spit out metrics summarizing how many operations they’ve performed in the last interval period.
JobTrackers and TaskTrackers use the “mapred” context to summarize their counters. This data is similar to what you’d find in the JobTracker’s status page.
mapred.tasktracker: hostName=doorstop.local, sessionId=, mapTaskSlots=2, maps_running=0, reduceTaskSlots=2, reduces_running=1, tasks_completed=10, tasks_failed_ping=0, tasks_failed_timeout=0
In this context you’ll also see transient per-job counter data (shuffle stats and job counters). Though most metrics persist as long as the process is running, per-job metrics are cleared out after the job completes.
Configuring other outputs
So far, we’ve been sending metrics data into a file, but, as we’ve seen, you can configure which class handles metrics updates. There aren’t many implementations of MetricsContext available, but a very useful one is GangliaContext. Specifying that class, along with a “servers” configuration option, enables you to send metrics to Ganglia 3.0. I recommend making the “servers” field use the real domain names (not “localhost”), because, in a cluster setting, ganglia will propagate 127.0.0.1 as the machine reporting these metrics, which isn’t unique.
Accessing metrics via JMX
Some (but not all) of Hadoop’s metrics are available also via Java Management Extensions (JMX). The easiest way to see these is to fire up jconsole (which comes with your Java SDK) and point it to your daemon.
Then, browse the metrics!
NameNodes and DataNodes both implement “MBeans” to report the data to JMX. Hadoop’s RPC servers expose metrics to JMX as well. All the daemons except the SecondaryNameNode have RPC servers.
You’ll find that jconsole is easiest if you’re running it as the same user as the Hadoop daemons (or root), on the same machine as the Java processes you’re interested in. Hadoop’s default hadoop-env.sh already has the “-Dcom.sun.management.jmxremote” option to enable JMX. JMX can also be configured for remote access. Refer to Sun’s documentation for the full details, but, in short, place “monitorRole secretPassword” in a file called “jmxremote.password” (with permissions 600), and modify lines in hadoop-env.sh like so:
export HADOOP_OPTS=”-Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=$HADOOP_CONF_DIR/jmxremote.password”
export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8004″
export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8005″
export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS -Dcom.sun.management.jmxremote.port=8006″
export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS -Dcom.sun.management.jmxremote.port=8007″
export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS -Dcom.sun.management.jmxremote.port=8008″
You’ll now be able to point jconsole to the appropriate port and authenticate.
If you want to monitor your Hadoop deployment with Hyperic (plugin here) or Nagios, using JMX is an easy way to bridge the gap between Hadoop and your monitoring software. There are a fair number of command-line oriented JMX tools, including cmdline-jmxclient, Twiddle (part of JBoss), check_jmx (a Nagios plugin), and many others. Hadoop will soon have its own (HADOOP-4756), too!
Note: some metrics are collected by a periodic timer thread, which is disabled by default. To enable these metrics, use org.apache.hadoop.metrics.spi.NullContextWithUpdateThread as the metrics implementation in conf/hadoop-metrics.properties. — June 8, 2009
Behind the Scenes
Hadoop’s metrics are part of the org.apache.hadoop.metrics package. If you want to implement your own metrics handler, you’ll need to implement your own MetricsContext, typically extending AbstractMetricsContext. If you want to add new metrics, see NameNodeMetrics for an example.