Schedule This! Strata + Hadoop World Speakers from Cloudera

We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)

The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.

Meet the Engineer: Jon Natkins

In this installment of “Meet the Engineers”, meet Jonathan Natkins,  also known as “Natty” by his friends and colleagues. 

What do you do at Cloudera, and in which Apache project are you involved?

Exploring Compression for Hadoop: One DBA’s Story

This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).

Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.

What Do Real-Life Apache Hadoop Workloads Look Like?

Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.

Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.

Meet the Engineer: Aaron T. Myers

Aaron T. Myers

As I mentioned in my inaugural post last week, it’s important to shine a spotlight on the Cloudera engineers who have a hand in making the Hadoop projects run. It’s an obvious point, and yet an overlooked one, that a community is an aggregation of individual personalities who have diverse backgrounds and interests yet a shared passion for the group and its goals. As Jono Bacon puts it in his seminal 2009 book The Art of Community, “The building blocks of a community are its teams, and the material that makes these blocks are people.”

Cloudera Software Engineer Eli Collins on Apache Hadoop and CDH4

In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.

You can catch the full interview (video and transcript versions) here.

CDH3 update 5 is now available

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

HttpFS for CDH3 – The Apache Hadoop FileSystem over HTTP

HttpFS is an HTTP gateway/proxy for Apache Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).

HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API. HttpFS also allows clients to access CDH3u4 clusters via the WebHDFS REST API.

NameNode Recovery Tools for the Hadoop Distributed File System

Warning: The procedure described below can cause data loss. Contact Cloudera Support before attempting it.

Most system administrators have had to deal with a bad hard disk at some point. One moment, the hard disk is a mechanical marvel; the next, it is an expensive paperweight.

Meet the Presenters: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks

This was originally posted on the Hadoop Summit 2012 blog.

Today’s “Meet the Presenters” interview features two speakers: Aaron Myers from Cloudera and Suresh Srinivas from Hortonworks. Aaron and Suresh will be presenting on HDFS NameNode High Availability, one of the hottest topics in the Apache Hadoop space today.

Question: Tell us about your current role and how you interact with Apache Hadoop?

