Cloudera Engineering Blog · Community Posts

Congrats to OSCON 2013 Speakers!

Cloudera will be a proud exhibitor at O’Reilly OSCON 2013 (July 22-26 in Portland, OR), which in our opinion is a shining light in the open source community. So be sure to look for us at Booth #420!

We Honor the Champions of Big Data!

In the technology business, building a thriving and progressive user ecosystem around a platform is about as Mom-and-apple-pie as you can get. We all intuitively acknowledge that it’s one of the metrics for success.

Seven Thoughts on Hadoop’s Seventh Birthday

On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop:

  1. Open source accelerates adoption.

    If Hadoop had been created as proprietary software it would not have spread as rapidly. We’ve seen incredible growth in the use of Hadoop. Partly that’s because it’s useful. But many would have been cautious to make a vendor-controlled platform part of their infrastructure, useful or not.

  2.  Apache builds collaborative communities.

Cloudera is the Top Big Data Influencer in Social Media

Thanks to our friends at KDNuggets for pointing out that Cloudera is the top influencer in the “Big Data” area, according to social media measurement service Klout – with a Klout Score of 81. (Klout is also a CDH user!)

Meet the HBaseCon 2013 Program Committee

With HBaseCon 2013 (Early Bird registration now open!) preparations in full swing, you may be interested in learning a bit about the personalities behind the Program Committee, who are tasked with formulating a compelling, community-focused agenda. 

Recently I had a chance to ask committee members Gary Helmling (Twitter), Lars Hofhansl (, Jon Hsieh (Cloudera), Doug Meil (Explorys), Andrew Purtell (Intel), Enis Söztutar (Hortonworks), Michael Stack (Cloudera), and Liyin Tang (Facebook) a few questions:

Meet the Engineer: Mark Grover

Mark Grover

In this installment, meet Cloudera Software Engineer/Apache Bigtop Committer Mark Grover (@mark_grover).

Cloudera ML: New Open Source Libraries and Tools for Data Scientists

Editor’s note (12/19/2013): Cloudera ML has been merged into the Oryx project. The information below is still valid though.

Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP, I wanted to take this opportunity to express my gratitude to the Crunch community, who have taught me that leadership in the Apache Way means service, humility, and investing more time in building a community than I spend writing code. Working with you all on our shared vision is the highlight of every work week.

Creating Analytical Applications with Crunch: Cloudera ML

Cloudera Speakers at Hadoop Summit Europe

Hadoop Summit Europe is coming up in Amsterdam next week, so this is an appropriate time to make you aware of the Cloudera speaker program there (all three talks on Thursday, March 21):

What the Hack! The Story of the Cloudera Hackathon

Every growing, dynamic engineering culture needs a hackathon every once in a while. 

Earlier this week, Cloudera put that thought into action with a two-day, around-the-clock “What the Hack!” internal hackathon in our Palo Alto offices, with our friends from Accel Partners underwriting the omnipresent food and beverage (thanks!). The carrot: “Fun surprise awards, and most important, the rights to brag about your cool hacking ideas.”

Open Source, Flattery, and The Platform for Big Data

It has been a busy time for announcements coinciding with this week’s Strata conference. There’s no corner of the technology world that has not embraced Apache Hadoop as the new platform for big data.  Apache Hadoop began as a telegram from the future from Google, turned into real software by Doug Cutting while on a freelance assignment. While Hadoop’s origins are surprising, its ongoing popularity is not – open source has been a major contributing factor to Hadoop’s current ubiquity. Easy to trial, fast to evolve, inexpensive to own: open source makes a compelling case for itself.

From the founding of the company, Cloudera recognized the importance of Apache open source to Hadoop’s continued evolution. We’re now entering our fifth year of shipping a 100% open source platform. Every significant advance we have added to the platform has stayed consistent to our open source strategy. In the process Cloudera has now sponsored the development of seven new open source projects including Apache Flume, Apache Sqoop, Apache Bigtop, Apache MRUnit, Cloudera Hue, Apache Crunch, and most recently, Cloudera Impala. Acknowledging the maxim “innovation happens elsewhere,” we’ve also managed to convince the founders and/or PMC chairs of Apache Hadoop, Apache Oozie, Apache Zookeeper, and Apache HBase to come join Cloudera.

Apache Pig: It Goes to 0.11

This blog was originally published at and is republished here for your convenience by permission of its author, Pig Committer Dmitriy Ryaboy.

After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 – it’s a great way to contribute to open source software.

Call for Speakers and Early Bird Registration: HBaseCon 2013

(Added Feb. 25 2013: Early Bird registration is now open – closes April 23, 2013!)

HBaseCon 2012 was such a stunning success - blowing past all expectations about attendance – that we want to double-down on the joy in 2013: The HBaseCon 2013 Call for Speakers is now open!

Where to Find Cloudera Tech Talks in Early 2013

Clouderans are traveling the United States (and beyond) in droves during the first quarter of 2013 to present at developer meetups and conferences. If you’re interested in attending one near you, we’ve listed them below – see links for specific topics (but note that some of the sites involved may not reflect complete event details yet; check back later for updates).

Let me point out in particular that:

Data Hacking Day with Cloudera (Feb. 25, Palo Alto)

(Update 2/6/2013 – Sorry, this event is sold out!)

With Strata Conference 2013 coming to town (Feb. 26-28, in Santa Clara, Calif.), we thought it would be a great opportunity to open our Palo Alto office’s doors for a pre-conference “Data Hacking Day” on Monday, Feb. 25!

Cloudera Speakers at ApacheCon NA 2013

Our hearty congratulations to the Cloudera engineers who have been accepted as ApacheCon NA 2013 (Feb. 26-28 in Portland, OR) speakers for these talks:

How to Contribute to Apache Hadoop Projects, in 24 Minutes

So, you want to report a bug, propose a new feature, or contribute code or doc to Apache Hadoop (or a related project), but you don’t know what to do and where to start? Don’t worry, you’re not alone.

Let us help: in this 24-minute screencast, Clouderan Jeff Bean (@jwfbean) offers a step-by-step tutorial that explains why and how to contribute. Apache JIRA ninjas need not view, but anyone else with slight (or less) familiarity with that curious beast will find this information very helpful. 

Apache HBase AssignmentManager Improvements

AssignmentManager is a module in the Apache HBase Master that manages regions to RegionServers assignment. (See HBase architecture for more information.) It ensures that all regions are assigned and each region is assigned to just one RegionServer.

Although the AssignmentManager generally does a good job, the existing implementation does not handle assignments as well as it could. For example, if a region was assigned to two or more RegionServers, some regions were stuck in transition and never got assigned, or unknown region exceptions were thrown in moving a region from one RegionServer to another.

Apache ZooKeeper 3.4.5 Has Been Released

Apache ZooKeeper release 3.4.5 is now available. This is a bug fix release covering 3 issues, one of which was considered critical. These issues were:

Dive Into Cloudera Impala at a Meetup Near You

Since the Cloudera Impala announcement of a few weeks ago, we’ve been busy partnering-up with Hadoop meetups around the country (and beyond) to bring Impala tech talks directly to the community. Here’s the list for the remainder of 2012, thus far:

Mike Olson at FutureBI Meetup (Berkeley, Nov. 6)

The FutureBI meetup is excitedly preparing to host Cloudera CEO Mike Olson at its upcoming meetup on Nov. 6 at the Berkeley School of Information, where he’ll be joined by SiSenseVP of Marketing Bruno Aziza in a conversation titled “The Future of Big Data: We depend on you!”  The event will be open to the public as well as to Berkeley students and faculty. 

Questions are encouraged during this informal and interactive session, which will cover topics ranging from the evolution of open source software, the changing entrepreneurial ecosystem in the Bay Area, and the likely future of information management.  Founders, geeks, and tech industry professionals are all welcome to attend and join the discussion!

Your Guide to Cloudera @ Strata + Hadoop World This Week

Cloudera is co-presenting the sold-out Strata Conference + Hadoop World in New York this week, and if you’re an attendee, you have a great week ahead!

Here’s a quick guide to where you can find Clouderans during the conference. There are of course many other great activities planned as well that are not covered here.


Apache Hadoop 2.0.2-alpha Released

Earlier this month the Apache Hadoop PMC released Apache Hadoop 2.0.2-alpha, which fixes over 600 issues since the previous release in the 2.0 series, 2.0.1-alpha, back in July. This is a tremendous rate of development, of which all contributors to the project should feel proud.

Some of the more noteworthy changes in this release include:

HBase at ApacheCon Europe 2012

Apache HBase will have a notable profile at ApacheCon Europenext month. Clouderan and HBase committer Lars George has two sessions on the schedule:

Meet the Engineer: Todd Lipcon


In this installment of “Meet the Engineers”, meet Todd Lipcon (@tlipcon), PMC member/committer for the Hadoop, HBase, and Thrift projects.

New Additions to the Apache HBase Team

StumbleUpon (SU) and Cloudera have signed a technology collaboration agreement. Cloudera will support the SU clusters, and in exchange, Cloudera will have access to a variety of production deploys on which to study and try out beta software.

As part of the agreement, the StumbleUpon Apache HBase+Apache Hadoop team — Jean-Daniel Cryans, Elliott Clark and I — have joined Cloudera. From our new perch up in the Cloudera San Francisco office — 10 blocks north and 11 floors up — we will continue as first-level support for SU clusters, tending and optimizing them as we have always done. The rest of our time will be spent helping develop Apache HBase as the newest additions to Cloudera’s HBase team.

Apache Hadoop Wins Duke’s Choice Award, is a Java Ecosystem “MVP”

For those of you new to it, the Duke’s Choice Awards program was initiated by Sun Microsystems in 2002 in an effort to “celebrate extreme innovation in the world of Java technology” – in essence, it’s the “MVP” of the Java ecosystem. Since it acquired Sun in 2009, Oracle has continued the tradition of bestowing the award, and in fact has made the process more community-oriented by accepting nominations from the public and involving Java User Groups in the judging effort.

For the 2012 awards, I’m happy to report that Apache Hadoop is among the awardees - which also include the United Nations High Commission for Refugees, Liquid Robotics, and Java cloud company Jelastic Inc., among others.

Schedule This! Strata + Hadoop World Speakers from Cloudera

We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)

The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.

Community Meetups at Strata + Hadoop World 2012

Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.

Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)

The Action on "HBase in Action"

HBase in Action

Apache HBase junkies, this one’s for you: I had an opportunity recently for a quick chat with the authors of HBase in Action (Manning Publications – download sample chapter PDF), by Nick Dimiduk and Cloudera’s Amandeep Khurana.

Meet the Engineer: Aaron T. Myers

Aaron T. Myers

As I mentioned in my inaugural post last week, it’s important to shine a spotlight on the Cloudera engineers who have a hand in making the Hadoop projects run. It’s an obvious point, and yet an overlooked one, that a community is an aggregation of individual personalities who have diverse backgrounds and interests yet a shared passion for the group and its goals. As Jono Bacon puts it in his seminal 2009 book The Art of Community, “The building blocks of a community are its teams, and the material that makes these blocks are people.”

Process a Million Songs with Apache Pig

The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!

Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at can be easily constructed from the data.

Developer Community Outreach from Cloudera: Better, Faster, More

Hello World: This is my first post as the new guy facilitating and coordinating developer community outreach for Cloudera. I am extremely excited to become a new node in the Apache Hadoop ecosystem and involved in the global, thriving community of developers coding against Hadoop and its related projects.

My most recent experience involves deep interest in the Java ecosystem – which achieved ubiquity not only for technical reasons, but also thanks to a vast group of passionate users who committed themselves to Java’s adoption. There are many lessons there for those of us who want to see the Hadoop edition meet that standard.

CDH3 update 5 is now available

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

Watching the Clock: Cloudera’s Response to Leap Second Troubles

At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.

Since Hadoop is written in Java and closely interoperates with the underlying OS, Cloudera Support troubleshoots not only all 17 components in the Hadoop ecosystem, but also any underlying Linux and Java bugs. Last weekend many of our customers were affected by the now infamous “leap second” bugs. Initially, many assumed that Java and Linux would process the leap second gracefully. However, we soon discovered that this wasn’t the case and depending on the version of Linux being used, several distinct issues were observed.


The Apache Hadoop Ecosystem, Visualized in Datameer

This is a guest re-post from Datameer’s Director of Marketing, Rich Taylor. The original post can be found on the Datameer blog.

Datameer uses D3.js to power our Business Infographic™ designer. I thought I would show how we visualized the Apache Hadoop ecosystem connections. First using only D3.js, and second using Datameer 2.0.

A Big Thank You to All Who Participated In Making HBaseCon and the HBase Hack-a-thon A Success

HBaseCon 2012 summation provided by Michael Stack, PMC Chair of the Apache HBase Project. HBase Hack-a-thon summation provided by David Wang, Engineering Manager for the Cloudera HBase team.

HBaseCon 2012 Summation

The Elephant in the Enterprise

On Tuesday, June 12th The Churchill Club of Silicon Valley hosted a panel discussion on Hadoop’s evolution from an open-source project to becoming a standard component of today’s enterprise computing fabric. The lively and dynamic discussion was moderated by Cade Metz, Editor, Wired Enterprise.

Panelists included:

Michael Driscoll, CEO, Metamarkets
Andrew Mendelsohn, SVP, Oracle Server Technologies
Mike Olson, CEO, Cloudera
Jay Parikh, VP Infrastructure Engineering, Facebook
John Schroeder, CEO, MapR

Meet the Presenter: Todd Lipcon

Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting Optimizing MapReduce Job Performance at Hadoop Summit.

Question: Tell us about your current role and how you interact with Apache Hadoop?

Todd: I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open source projects like Apache Hadoop and Apache HBase. Most recently I’ve been implementing the automatic HA failover feature in Hadoop 2.0, but I’ve also spent a lot of time working on understanding and improving performance of the Hadoop stack.

Question: Tell us about your Hadoop Summit presentation?

Apache MRUnit 0.9.0-incubating has been released!

This post was originally posted on the Apache Software Foundation’s blog.

We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.

HBaseCon 2012: A Glimpse into the Operations Track

HBaseCon 2012 is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out.

HBaseCon 2012: A Glimpse into the Development Track

HBaseCon 2012 is nearly a month away, and if the conference agenda and attendee registration numbers are good indicators, this will be an annual event you won’t want to miss.

Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. This means that when you want to use Hadoop for real-time data processing, HBase is the project you are looking for. The HBase developer community includes contributors from many organizations such as StumbleUpon, Facebook,, TrendMicro, eBay, Explorys, Huawei and Cloudera. In fact, the HBaseCon Program Committee, constructors of the HBaseCon 2012 agenda, are all committers and PMC members of the Apache HBase project.

HBaseCon 2012: A Glimpse into the Applications Track

Apache Bigtop 0.3.0 (incubating) has been released

Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:

High Availability for the Hadoop Distributed File System (HDFS)


Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.

HDFS has long been considered a highly reliable file system.  An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.

January 2012 Bay Area HBase User Group meetup summary + HBaseCon announcement

More than 150 people attended the San Francisco Bay Area HBase User Group meetup last Thursday, January 19th, at eBay headquarters in San Jose, California.  Presenters from StumbleUpon, Facebook, eBay and MapR shared a wealth of information about Apache HBase operations and optimizations, gleaned from their experience running HBase in production environments.

One special item of note: Michael Stack announced HBaseCon 2012, taking place this spring in the Bay Area.  This inaugural conference will focus on the growth and education of the HBase community.  While details of the event are not yet published, the call for speakers is currently open.  Submit your abstract here.

Hadoop World 2011 Videos and Slides Available

Last November in New York City, Hadoop World, the largest conference of Apache Hadoop practitioners, developers, business executives, industry luminaries and innovative companies took place. The enthusiasm for the possibilities in Big Data management and analytics with Hadoop was palpable across the conference. Cloudera CEO, Mike Olson, summarizes Hadoop World 2011 in these final remarks.

Those who attended Hadoop World know how difficult navigating a route between two days of five parallel tracks of compelling content can be—particularly since Hadoop World 2011 consisted of sixty-five informative sessions about Hadoop. Understanding that it is nearly impossible to obtain and/or retain all the valuable information shared live at the event, we have compiled all the Hadoop World presentation slides and videos for perusing, sharing and for reference at your convenience. You can turn to these resources for technical Hadoop help and real-world production Hadoop examples, as well as information about advanced data science analytics.

Apache Sqoop: Highlights of Sqoop 2

This blog was originally posted on the Apache Blog:

Apache Sqoop (incubating) was created to efficiently transfer bulk data between Hadoop and external structured datastores, such as RDBMS and data warehouses, because databases are not easily accessible by Hadoop. Sqoop is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at

Cloudera Manager – Thank You Customers!

Bala Venkatrao is the Director of Product Management at Cloudera.

As many of you know, we recently launched Cloudera Enterprise 3.7. Here’s the link to the press release This release marked a transition from Cloudera Management Suite (CMS) to Cloudera Manager (CM), the industry’s first and most comprehensive management application for Apache Hadoop. Over the last month we have received very positive feedback from our customers. I want to thank again all the Clouderans who spent countless hours bringing this product to market. I also want to take this opportunity to thank our customers for helping us get here, as many of them helped us to prioritize the key features for this release. Several customers have also shared the challenges/use cases from their Hadoop deployments and the need for specific features (more later) in Cloudera Manager. Many customers were actively involved in usability testing sessions for Cloudera Manager, which were immensely helpful!

Oracle selects CDH and Cloudera Manager as the Apache Hadoop Platform for the Oracle Big Data Appliance

Cloudera users gain more choice, tighter Oracle integration. Cloudera partners gain increased validation of their platform choice.

Ed Albanese
Ed leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.

Apache Hadoop in 2011

2011 was a breakthrough year for Apache Hadoop as many more mainstream organizations large and small turned to Hadoop to manage and process Big Data, while enterprise software and hardware vendors have also made Hadoop a prominent part of their offerings. Big Data and Hadoop became synonymous in much of the enterprise discourse, and Big Data interest is not restricted to Big Companies.

Apache Hadoop Releases

Hadoop had three major releases in 2011: 1.0 (AKA 0.20.205.x), 0.22, and 0.23.

Newer Posts Older Posts