Tag Archives: java

Configuration Parameters: What can you just ignore?

Categories: General Hadoop HDFS MapReduce

Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”

Nigel's guitar goes to 11, but your cluster might not. At Cloudera,

Read more

Hadoop Metrics

Categories: Hadoop

Hadoop’s NameNode, SecondaryNameNode, DataNode, JobTracker, and TaskTracker daemons all expose runtime metrics. These are handy for monitoring and ad-hoc exploration of the system and provide a goldmine of historical data when debugging.

In this post, I’ll first talk about saving metrics to a file.  Then we’ll walk through some of the metrics data.  Finally, I’ll show you how to configure sending metrics to other systems and explore them with jconsole.

Read more

Database Access with Apache Hadoop

Categories: General Hadoop MapReduce

Editor’s note (added Nov. 9. 2013): Valuable data in an organization is often stored in relational database systems. To access that data, you could use external APIs as detailed in this blog post below, or you could use Apache Sqoop, an open source tool (packaged inside CDH) that allows users to import data from a relational database into Apache Hadoop for further processing. Sqoop can also export those results back to the database for consumption by other clients.

Read more

The Small Files Problem

Categories: General Hadoop

Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topic is anything to go by. In this post I’ll look at the problem, and examine some common solutions.

Problems with small files and HDFS

A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files,

Read more