Configuring Eclipse for Apache Hadoop Development (a screencast)

Categories: Data Ingestion General HDFS Training

Update (added 5/15/2013): The information below is dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.

One of the perks of using Java is the availability of functional, cross-platform IDEs.  I use vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.

Typically, when you’re developing Map-Reduce applications,

Read more

Hive and JobTracker Needed Logos…

Categories: Hadoop Hive

In the process of working on a few things here I wanted to add some links to launch Apache Hive and the Hadoop Jobtracker. At first I considered just adding the links but I found myself wanting a button of some sort; an icon for them. I didn’t want to just use the (awesomely cute) Apache Hadoop logo elephant because these things are related to and part of Hadoop, but they aren’t Hadoop itself…

Read more

Cloudera’s Distribution for Apache Hadoop: Making Hadoop Easier for a Sysadmin

Categories: Hadoop

A few weeks ago we announced Cloudera’s Distribution for Apache Hadoop, and I want to spend some time showing how our distribution makes a sysadmin’s job a little easier.

Perhaps the most useful features in our distribution, at least for sysadmins, are RPM packages and init scripts.  RPMs are the standard way of installing software on a Red Hat Linux distribution (RHEL, Fedora Core, CentOS).  They give sysadmins a one-command install,

Read more

Upcoming Functionality in “Fair Scheduler 2.0”

Categories: General Hadoop MapReduce

(guest blog post by Matei Zaharia)

As Hadoop clusters grow in size and data volume, it becomes more and more useful to share them between multiple users and to isolate these users. If User 1 is running a ten-hour machine learning job for example, this should not impair a User 2 from running a 2-minute Hive query. In November, I blogged about how Hadoop 0.19 supports pluggable job schedulers,

Read more

Configuration Parameters: What can you just ignore?

Categories: General Hadoop HDFS MapReduce

Configuring a Hadoop cluster is something akin to voodoo. There are a large number of variables in hadoop-default.xml that you can override in hadoop-site.xml. Some specify file paths on your system, but others adjust levers and knobs deep inside Hadoop’s guts. Unfortuately, there’s little or no documentation on how to set them well. Is there a single optimal configuration? Are there some settings that can just be “set to 11?”

Nigel's guitar goes to 11, but your cluster might not. At Cloudera,

Read more