(guest blog post by Matei Zaharia)
As Hadoop clusters grow in size and data volume, it becomes more and more useful to share them between multiple users and to isolate these users. If User 1 is running a ten-hour machine learning job for example, this should not impair a User 2 from running a 2-minute Hive query. In November, I blogged about how Hadoop 0.19 supports pluggable job schedulers,
Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”
One of the repeating themes we have heard while working with our customers and the community is that Apache Hadoop configuration and deployment is a pain. Often times, Hadoop is the first truly distributed system that administrators encounter, and the problem is made worse by the lack of standardized packages and deployment tools. And some releases are buggy. And upgrades are hard. And the list goes on.
In order for Hadoop to truly disrupt the enterprise,
Hadoop’s NameNode, SecondaryNameNode, DataNode, JobTracker, and TaskTracker daemons all expose runtime metrics. These are handy for monitoring and ad-hoc exploration of the system and provide a goldmine of historical data when debugging.
In this post, I’ll first talk about saving metrics to a file. Then we’ll walk through some of the metrics data. Finally, I’ll show you how to configure sending metrics to other systems and explore them with jconsole.
Editor’s note (added Nov. 9. 2013): Valuable data in an organization is often stored in relational database systems. To access that data, you could use external APIs as detailed in this blog post below, or you could use Apache Sqoop, an open source tool (packaged inside CDH) that allows users to import data from a relational database into Apache Hadoop for further processing. Sqoop can also export those results back to the database for consumption by other clients.