Hadoop Metrics

Hadoop’s NameNode, SecondaryNameNode, DataNode, JobTracker, and TaskTracker daemons all expose runtime metrics. These are handy for monitoring and ad-hoc exploration of the system and provide a goldmine of historical data when debugging.

In this post, I’ll first talk about saving metrics to a file.  Then we’ll walk through some of the metrics data.  Finally, I’ll show you how to configure sending metrics to other systems and explore them with jconsole.

Dumping metrics to a file

The simplest way to configure Hadoop metrics is to funnel them into a user-configurable file on the machine running the daemon.  Metrics are organized into “contexts” (Hadoop currently uses “jvm”, “dfs”, “mapred”, and “rpc”), and each context is independently configured.  Setup your conf/hadoop-metrics.properties to use FileContext like so:

# Configuration of the “dfs” context for file
dfs.class=org.apache.hadoop.metrics.file.FileContext
dfs.period=10
# You’ll want to change the path
dfs.fileName=/tmp/dfsmetrics.log
# Configuration of the “mapred” context for file
mapred.class=org.apache.hadoop.metrics.file.FileContext
mapred.period=10
mapred.fileName=/tmp/mrmetrics.log
# Configuration of the “jvm” context for file
jvm.class=org.apache.hadoop.metrics.file.FileContext
jvm.period=10
jvm.fileName=/tmp/jvmmetrics.log
# Configuration of the “rpc” context for file
rpc.class=org.apache.hadoop.metrics.file.FileContext
rpc.period=10
rpc.fileName=/tmp/rpcmetrics.log

With this configuration Hadoop daemons will dump their metrics to text files at ten second intervals.  Once your daemons are up and running, you’ll start seeing those log files populated.

Exploring the metrics data

Let’s explore some of the data.  If you want to dig right in, I’ve uploaded the metrics log files from a Hadoop cluster running locally on my machine.

The “jvm” context contains some basic stats from the JVM: memory usage, thread counts, garbage collection, etc.  All Hadoop daemons use this context (the implementation is in the JvmMetrics class).  Because I’m running all the daemons locally for this blog post, my “jvmmetrics.log” has lines from all the deamons.

jvm.metrics: hostName=doorstop.local, processName=DataNode, sessionId=, gcCount=15, gcTimeMillis=58, logError=0, logFatal=0, logInfo=159, logWarn=0, memHeapCommittedM=7.4375, memHeapUsedM=5.5513763, memNonHeapCommittedM=23.1875, memNonHeapUsedM=16.977356, threadsBlocked=0, threadsNew=0, threadsRunnable=7, threadsTerminated=0, threadsTimedWaiting=8, threadsWaiting=6
jvm.metrics: hostName=doorstop.local, processName=SecondaryNameNode, sessionId=, gcCount=11, gcTimeMillis=53, logError=0, logFatal=0, logInfo=12, logWarn=3, memHeapCommittedM=7.4375, memHeapUsedM=4.345642, memNonHeapCommittedM=23.1875, memNonHeapUsedM=16.32586, threadsBlocked=0, threadsNew=0, threadsRunnable=5, threadsTerminated=0, threadsTimedWaiting=4, threadsWaiting=2
[...]

In the “dfs” context, “dfs.FSNamesystem” contains information similar to what you’d expect from the NameNode status page: capacity, number of files, under-replicated blocks, etc.

dfs.FSNamesystem: hostName=doorstop.local, sessionId=, BlocksTotal=44, CapacityRemainingGB=78, CapacityTotalGB=201, CapacityUsedGB=0, FilesTotal=60, PendingReplicationBlocks=0, ScheduledReplicationBlocks=0, TotalLoad=1, UnderReplicatedBlocks=44

The DataNode and NameNode daemons also spit out metrics summarizing how many operations they’ve performed in the last interval period.

JobTrackers and TaskTrackers use the “mapred” context to summarize their counters. This data is similar to what you’d find in the JobTracker’s status page.

mapred.jobtracker: hostName=doorstop.local, sessionId=, jobs_completed=0, jobs_submitted=1, maps_completed=10, maps_launched=10, reduces_completed=0, reduces_launched=1
mapred.tasktracker: hostName=doorstop.local, sessionId=, mapTaskSlots=2, maps_running=0, reduceTaskSlots=2, reduces_running=1, tasks_completed=10, tasks_failed_ping=0, tasks_failed_timeout=0

In this context you’ll also see transient per-job counter data (shuffle stats and job counters). Though most metrics persist as long as the process is running, per-job metrics are cleared out after the job completes.

Configuring other outputs

So far, we’ve been sending metrics data into a file, but, as we’ve seen, you can configure which class handles metrics updates. There aren’t many implementations of MetricsContext available, but a very useful one is GangliaContext.  Specifying that class, along with a “servers” configuration option, enables you to send metrics to Ganglia 3.0.  I recommend making the “servers” field use the real domain names (not “localhost”), because, in a cluster setting, ganglia will propagate 127.0.0.1 as the machine reporting these metrics, which isn’t unique.

Be wary (HADOOP-4675) that GangliaContext is incompatible with Ganglia 3.1 (the latest version).  See the Hadoop wiki page for more information.

Accessing metrics via JMX

Some (but not all) of Hadoop’s metrics are available also via Java Management Extensions (JMX).  The easiest way to see these is to fire up jconsole (which comes with your Java SDK) and point it to your daemon.

Then, browse the metrics!

NameNodes and DataNodes both implement “MBeans” to report the data to JMX. Hadoop’s RPC servers expose metrics to JMX as well.  All the daemons except the SecondaryNameNode have RPC servers.

You’ll find that jconsole is easiest if you’re running it as the same user as the Hadoop daemons (or root), on the same machine as the Java processes you’re interested in. Hadoop’s default hadoop-env.sh already has the “-Dcom.sun.management.jmxremote” option to enable JMX.  JMX can also be configured for remote access.  Refer to Sun’s documentation for the full details, but, in short, place “monitorRole secretPassword” in a file called “jmxremote.password” (with permissions 600), and modify lines in hadoop-env.sh like so:

# Extra Java runtime options. Empty by default.
export HADOOP_OPTS=”-Dcom.sun.management.jmxremote.authenticate=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.password.file=$HADOOP_CONF_DIR/jmxremote.password”
# Command specific options appended to HADOOP_OPTS when specified
export HADOOP_NAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8004″
export HADOOP_SECONDARYNAMENODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS -Dcom.sun.management.jmxremote.port=8005″
export HADOOP_DATANODE_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS -Dcom.sun.management.jmxremote.port=8006″
export HADOOP_BALANCER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS -Dcom.sun.management.jmxremote.port=8007″
export HADOOP_JOBTRACKER_OPTS=”-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS -Dcom.sun.management.jmxremote.port=8008″
export HADOOP_TASKTRACKER_OPTS=”-Dcom.sun.management.jmxremote.port=8009″

You’ll now be able to point jconsole to the appropriate port and authenticate.

If you want to monitor your Hadoop deployment with Hyperic (plugin here) or Nagios, using JMX is an easy way to bridge the gap between Hadoop and your monitoring software. There are a fair number of command-line oriented JMX tools, including cmdline-jmxclient, Twiddle (part of JBoss), check_jmx (a Nagios plugin), and many others. Hadoop will soon have its own (HADOOP-4756), too!

Note: some metrics are collected by a periodic timer thread, which is disabled by default. To enable these metrics, use org.apache.hadoop.metrics.spi.NullContextWithUpdateThread as the metrics implementation in conf/hadoop-metrics.properties. — June 8, 2009

Behind the Scenes

Hadoop’s metrics are part of the org.apache.hadoop.metrics package.  If you want to implement your own metrics handler, you’ll need to implement your own MetricsContext, typically extending AbstractMetricsContext.  If you want to add new metrics, see NameNodeMetrics for an example.

Filed under:

9 Responses
  • Boris Shkolnik / March 13, 2009 / 12:28 AM

    if you need to script your JMX access to these metrics you can look at hdfs jmxget command recently added to the trunk. It allows you to get any or all of the metrics into your script.

  • Alex Rovner / April 26, 2010 / 1:01 PM

    How would I expose job specific counters to the monitoring software?

  • Jon Stevens / December 18, 2010 / 4:50 PM

    After finding a deep lack in non-commercial tools for reliably querying clusters of JVM’s for data via JMX (and graphing it), I decided to create my own.

    http://jmxtrans.googlecode.com/

    It is still in a bit of development to get it just right, but should serve a good basis for any project.

  • Jeff / June 26, 2011 / 2:27 PM

    Any chance of an update to this article that explains the configuration process using the new metrics2.properties file, now that metrics.properties is deprecated? Anywhere online to find a good guide?

  • Tobias / July 13, 2011 / 6:06 PM

    the metrics package is now deprecated in hadoop, they are using metrics2: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/metrics2/package-summary.html

  • Arun / September 13, 2012 / 2:53 AM

    Since hadoop is a distribued setup, does monitoring via JMX from a single JVM (i would say even a node) effective? How is it taken care that all metrics fetched from name node server are correct for that particular time?

Leave a comment


− 7 = zero