The Definitive "Getting Started" Tutorial for Apache Hadoop + Your Own Demo Cluster

Categories: CDH Cloud General Hadoop How-to

Using this new tutorial alongside Cloudera Live is now the fastest, easiest, and most hands-on way to get started with Hadoop.

At Cloudera, developer enablement is one of our most important objectives. One only has to look at examples from history (Java or SQL, for example) to know that knowledge fuels the ecosystem. That objective is what drives initiatives such as our community forums, the Cloudera QuickStart VM, and this blog itself.

Cloudera Live

Today, we are providing what we believe is a model for Hadoop developer enablement going forward: a definitive end-to-end tutorial and free, cloud-based demo cluster and sample data for hands-on exercises, via the Cloudera Live program.

When Cloudera Live was launched in April 2014, it initially contained a read-only environment where users could experiment with CDH, our open source platform containing the Hadoop stack, for a few hours. Today, we are launching a new interactive version (hosted by GoGrid) in which you can use pre-loaded datasets or your own data, and which is available to you for free for two weeks. Furthermore, the environment is available in two other flavors—with Tableau or Zoomdata included—so you can test-drive CDH and Cloudera Manager alongside familiar BI tools, too.

Now, back to that tutorial:

To There and Back

Most Hadoop tutorials take a piecemeal approach: they either focus on one or two components, or at best a segment of the end-to-end process (just data ingestion, just batch processing, or just analytics). Furthermore, few if any provide a business context that makes the exercise pragmatic.

This new tutorial closes both gaps. It takes the reader through the complete Hadoop data lifecycle—from data ingestion through interactive data discovery—and does so while emphasizing the business questions concerned: What products do customers view on the Web, what do they like to buy, and is there a relationship between the two?

Getting those answers is a task that organizations with traditional infrastructure have been doing for years. However, the ones that bought into Hadoop do the same thing at greater scale, at lower cost, and on the same storage substrate (with no ETL, that is) upon which many other types of analysis can be done.

To learn how to do that, in this tutorial (and assuming you are using our sample dataset) you will:

  • Load relational and clickstream data into HDFS (via Apache Sqoop and Apache Flume respectively)
  • Use Apache Avro to serialize/prepare that data for analysis
  • Create Apache Hive tables
  • Query those tables using Hive or Impala (via the Hue GUI)
  • Index the clickstream data using Flume, Cloudera Search, and Morphlines, and expose a search GUI for business users/analysts

Go Live

We think that even on its own, this tutorial will be a huge help to developers of all skill levels—and with Cloudera Live in the mix as a demo backend for doing the hands-on exercises, it’s almost irresistible.

If you have any comments or encounter a roadblock, let us know about it in this discussion forum.

Justin Kestelyn is Cloudera’s developer outreach director.