Cloudera Blog · Community Posts
In the technology business, building a thriving and progressive user ecosystem around a platform is about as Mom-and-apple-pie as you can get. We all intuitively acknowledge that it’s one of the metrics for success.
Perhaps the most under-appreciated aspect of any platform ecosystem is the recognition that it is fundamentally built by real people. Without enthusiastic users of a platform engaging as evangelists on its behalf, the growth of the ecosystem around it will eventually slow to a crawl.
On this special April 1 – the seven-year anniversary of the Apache Hadoop project’s first release – Hadoop founder Doug Cutting (also Cloudera’s chief architect and the Apache Software Foundation chair) offers seven thoughts on Hadoop:
- Open source accelerates adoption.
If Hadoop had been created as proprietary software it would not have spread as rapidly. We’ve seen incredible growth in the use of Hadoop. Partly that’s because it’s useful. But many would have been cautious to make a vendor-controlled platform part of their infrastructure, useful or not.
- Apache builds collaborative communities.
The Hadoop ecosystem has hundreds of developers working for tens of organizations. Competitors productively collaborate on a daily basis, improving the software we all share. The Apache Software Foundation gives us the methodology that enables this. (Thanks, Apache!)
- The timing is right.
Thanks to our friends at KDNuggets for pointing out that Cloudera is the top influencer in the “Big Data” area, according to social media measurement service Klout – with a Klout Score of 81. (Klout is also a CDH user!)
Klout, by the way, defines “influence” as ”the ability to drive action, such as sharing a picture that triggers comments and likes, or tweeting about a great restaurant and causing your followers to go try it for themselves.”
With HBaseCon 2013 (Early Bird registration now open!) preparations in full swing, you may be interested in learning a bit about the personalities behind the Program Committee, who are tasked with formulating a compelling, community-focused agenda.
Recently I had a chance to ask committee members Gary Helmling (Twitter), Lars Hofhansl (Salesforce.com), Jon Hsieh (Cloudera), Doug Meil (Explorys), Andrew Purtell (Intel), Enis Söztutar (Hortonworks), Michael Stack (Cloudera), and Liyin Tang (Facebook) a few questions:
How did you get involved in the HBase community?
In this installment, meet Cloudera Software Engineer Mark Grover (@mark_grover).
What do you do at Cloudera and in which Apache project are you involved?
I’m a Software Engineer at Cloudera, involved mostly with Apache Bigtop, an open source project aimed at building a community around packaging and interoperability testing of projects in the Apache Hadoop ecosystem. In addition, I contribute to Apache Hive, a data warehousing system built on top of Apache Hadoop that allows users to structure and query their Hadoop data using familiar SQL-like syntax. I have also written a section in O’Reilly’s book on Hive, Programming Hive.
Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP, I wanted to take this opportunity to express my gratitude to the Crunch community, who have taught me that leadership in the Apache Way means service, humility, and investing more time in building a community than I spend writing code. Working with you all on our shared vision is the highlight of every work week.
Creating Analytical Applications with Crunch: Cloudera ML
The Crunch Java libraries operate at a lower level of abstraction than other tools for creating MapReduce pipelines, like Apache Pig, Apache Hive, or Cascading. Crunch does not make any assumptions about the data model in your pipeline, which makes it easy to create data pipelines over non-relational data sources such as time series, Avro records, and Mahout Vectors. In fact, I originally wrote Crunch while I was working on Seismic Hadoop, a command line tool for processing time series of seismic measurements on Hadoop.
When the data science team sat down with our training team to begin planning our next data science course, we quickly discovered that there weren’t any open-source tools in the Hadoop ecosystem that would allow students to perform the data preparation and model evaluation techniques that we wanted them to learn. For example, it wasn’t possible to quickly summarize a CSV file of numerical and categorical variables via a single MapReduce job, and then use that summary to convert the CSV file into the distributed matrix format that is used as input to many of Mahout’s algorithms. We were also concerned that there wasn’t a lot of guidance as to how to choose values for many of the parameters that Mahout’s algorithms require, and that this might discourage new data scientists from using these models effectively.
Hadoop Summit Europe is coming up in Amsterdam next week, so this is an appropriate time to make you aware of the Cloudera speaker program there (all three talks on Thursday, March 21):
Every growing, dynamic engineering culture needs a hackathon every once in a while.
Earlier this week, Cloudera put that thought into action with a two-day, around-the-clock “What the Hack!” internal hackathon in our Palo Alto offices, with our friends from Accel Partners underwriting the omnipresent food and beverage (thanks!). The carrot: “Fun surprise awards, and most important, the rights to brag about your cool hacking ideas.”
The morning began with a warm and festive welcome:
It has been a busy time for announcements coinciding with this week’s Strata conference. There’s no corner of the technology world that has not embraced Apache Hadoop as the new platform for big data. Apache Hadoop began as a telegram from the future from Google, turned into real software by Doug Cutting while on a freelance assignment. While Hadoop’s origins are surprising, its ongoing popularity is not – open source has been a major contributing factor to Hadoop’s current ubiquity. Easy to trial, fast to evolve, inexpensive to own: open source makes a compelling case for itself.
From the founding of the company, Cloudera recognized the importance of Apache open source to Hadoop’s continued evolution. We’re now entering our fifth year of shipping a 100% open source platform. Every significant advance we have added to the platform has stayed consistent to our open source strategy. In the process Cloudera has now sponsored the development of seven new open source projects including Apache Flume, Apache Sqoop, Apache Bigtop, Apache MRUnit, Cloudera Hue, Apache Crunch, and most recently, Cloudera Impala. Acknowledging the maxim “innovation happens elsewhere,” we’ve also managed to convince the founders and/or PMC chairs of Apache Hadoop, Apache Oozie, Apache Zookeeper, and Apache HBase to come join Cloudera.
Our investment in open source is not altruistic — we think it is good business. Today, Cloudera employees contribute more patches to the Apache Hadoop ecosystem than every other software vendor combined. Meanwhile more enterprises have adopted our open source platform than every other Hadoop distribution combined. We do not think it is a coincidence that these two things are simultaneously true.
This blog was originally published at blog.apache.org/pig and is republished here for your convenience by permission of its author, Pig Committer Dmitriy Ryaboy.
After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 – it’s a great way to contribute to open source software.
This blog post hits some of the highlights of the release. Pig users may also find a presentation by Daniel Dai, which includes code and output samples for the new operators, helpful.