The Top 10 Posts of 2014 from the Cloudera Engineering Blog

Categories: Community Hadoop Spark

Our “Top 10” list of blog posts published during a calendar year is a crowd favorite (see the 2013 version here), in particular because it serves as informal, crowdsourced research about popular interests. Page views don’t lie (although skew for publishing date—clearly, posts that publish earlier in the year have pole position—has to be taken into account). 

In 2014, a strong interest in various new components that bring real time or near-real time capabilities to the Apache Hadoop ecosystem is apparent. And we’re particularly proud that the most popular post was authored by a non-employee.

  1. How-to: Create a Simple Hadoop Cluster with VirtualBox
    by Christian Javet
    Explains how t set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.
  2. Why Apache Spark is a Crossover Hit for Data Scientists
    by Sean Owen

    An explanation of why Spark is a compelling multi-purpose platform for use cases that span investigative, as well as operational, analytics. 
  3. How-to: Run a Simple Spark App in CDH 5
    by Sandy Ryza
    Helps you get started with Spark using a simple example.
  4. New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead
    by Justin Erickson, Marcel Kornacker & Dileep Kumar

    Open benchmark testing of Impala 1.3 demonstrates performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.
  5. Apache Kafka for Beginners
    by Gwen Shapira & Jeff Holoman
    When used in the right way and for the right use case, Kafka has unique attributes that make it a highly attractive option for data integration.
  6. Apache Hadoop YARN: Avoiding 6 Time-Consuming “Gotchas”
    by Jeff Bean
    Understanding some key differences between MR1 and MR2/YARN will make your migration much easier.
  7. Impala Performance Update: Now Reaching DBMS-Class Speed
    by Justin Erickson, Greg Rahn, Marcel Kornacker & Yanpei Chen
    As of release 1.1.1, Impala’s speed beat the fastest SQL-on-Hadoop alternatives–including a popular analytic DBMS running on its own proprietary data store.
  8. The Truth About MapReduce Performance on SSDs
    by Karthik Kambatla & Yanpei Chen

    It turns out that cost-per-performance, not cost-per-capacity, is the better metric for evaluating the true value of SSDs. (See the session on this topic at Strata+Hadoop World San Jose in Feb. 2015!)
  9. How-to: Translate from MapReduce to Spark
    by Sean Owen

    The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
  10. How-to: Write and Run Apache Giraph Jobs on Hadoop
    by Mirko Kämpf
    Explains how to create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Based on the above, a significant number of you are at least exploring Apache Spark as an eventual replacement for MapReduce, as well as tracking Impala’s progress as the standard analytic database for Apache Hadoop. What will next year bring, do you think? 

Justin Kestelyn is Cloudera’s developer outreach director.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

2 responses on “The Top 10 Posts of 2014 from the Cloudera Engineering Blog