Top 10 Blog Posts of 2013

From Python, to ZooKeeper, to Impala, to Parquet, blog readers in 2013 were interested in a variety of topics.

Clouderans and guest authors from across the ecosystem (LinkedIn, Netflix, Concurrent, Etsy, Stripe, Databricks, Oracle, Tableau, Alteryx, Talend, Twitter, Dell, Concurrent, SFDC, Endgame, MicroStrategy, Hazy Research, Wibidata, StackIQ, ZoomData, Damballa, Mu Sigma) published prolifically on the Cloudera Developer blog in 2013, with more than 250 new posts — basically, averaging one per business day.

These were the most popular ones published in 2013:

  1. A Guide to Python Frameworks for Apache Hadoop (by Uri Laserson)
    Uri wrote the definitive guide on this subject, if its ongoing popularity is any guide (which it is).
  2. Algorithms Every Data Scientist Should Know: Reservoir Sampling (by Josh Wills)
    Few people are more likely to Know What Every Data Scientist Should Know than Josh.
  3. How-to: Create a CDH Cluster on Amazon EC2 via Cloudera Manager (by Emanuel Buzek)
    Deploying CDH to the AWS cloud is getting easier and easier – and now, it’s a supported platform for production workloads.
  4. How-to: Configure Eclipse for Hadoop Contributions (by Karthik Kambatla)
    Pragmatic guidance for a pragmatic topic.
  5. How-to: Use Apache ZooKeeper to Build Distributed Apps (and Why) (by Sean Mackrory)
    ZooKeeper is rapidly emerging into the sunlight from relative obscurity as distributed systems move closer to the norm.
  6. Cloudera Impala 1.0: It’s Here, It’s Real, It’s Already the Standard for SQL on Hadoop (by Justin Erickson & Marcel Kornacker)
    The Impala roadmap is so bright, we’re all wearing shades. Since this post was published, yet more significant milestones have been met.
  7. How-to: Select the Right Hardware for Your New Hadoop Cluster (by Kevin O’Dell)
    Some of the most common questions we get involve hardware sizing. This post offers the state of the art in guidance.
  8. How-to: Analyze Twitter Data with Hue (by Romain Rigaux)
    Hue, the open source Web UI for Hadoop, used to be just a diamond in the rough — today, it’s a real gem.
  9. Cloudera ML: New Open Source Libraries and Tools for Data Scientists (by Josh Wills)
    Josh Wills describes the precursor project to Oryx. Still a good guide.
  10. Introducing Parquet: Efficient Columnar Storage for Hadoop (by multiple authors from Cloudera, Criteo, and Twitter)
    Parquet is bringing the performance benefits of columnar data representation to all Hadoop ecosystem projects. It all started with this.

We’re just getting started. Looks like 2014 will be even more exciting!

Justin Kestelyn is Cloudera’s developer outreach director.

Filed under:

2 Responses
  • Kirankumar / January 05, 2014 / 12:43 PM

    I need some explanation about below scenario.
    Scenario:–>
    Suppose there is complete records of each human being in the world in a file of 1000PB, we transferred that file into HDFS ( let’s say Reflection_factor= 9 bl_size=128MB ) and that file divide into ‘n’ number of blocks.

    Suppose Client asked us to search a particular person with some unique constraints (Key). Lets assume that the person’s data is in the nth block.

    My question is how MapReduce function will work in this case? is it directly read the nth block or it will read first node to nth node?

  • Justin Kestelyn (@kestelyn) / January 06, 2014 / 9:09 AM

    Kirankumar,

    I suggest you post this question at community.cloudera.com.

Leave a comment


eight − = 0