Cloudera Blog · Community Posts
Earlier this month the Apache Hadoop PMC released Apache Hadoop 2.0.2-alpha, which fixes over 600 issues since the previous release in the 2.0 series, 2.0.1-alpha, back in July. This is a tremendous rate of development, of which all contributors to the project should feel proud.
Some of the more noteworthy changes in this release include:
Apache HBase will have a notable profile at ApacheCon Europenext month. Clouderan and HBase committer Lars George has two sessions on the schedule:
In this installment of “Meet the Engineers”, meet Todd Lipcon (@tlipcon), PMC member/committer for the Hadoop, HBase, and Thrift projects.
What do you do at Cloudera, and in which Apache project are you involved?
StumbleUpon (SU) and Cloudera have signed a technology collaboration agreement. Cloudera will support the SU clusters, and in exchange, Cloudera will have access to a variety of production deploys on which to study and try out beta software.
As part of the agreement, the StumbleUpon Apache HBase+Apache Hadoop team — Jean-Daniel Cryans, Elliott Clark and I — have joined Cloudera. From our new perch up in the Cloudera San Francisco office — 10 blocks north and 11 floors up — we will continue as first-level support for SU clusters, tending and optimizing them as we have always done. The rest of our time will be spent helping develop Apache HBase as the newest additions to Cloudera’s HBase team.
We do not foresee this transition disrupting our roles as contributors to HBase. If anything, we look forward to contributing even more than in the past.
For those of you new to it, the Duke’s Choice Awards program was initiated by Sun Microsystems in 2002 in an effort to “celebrate extreme innovation in the world of Java technology” – in essence, it’s the “MVP” of the Java ecosystem. Since it acquired Sun in 2009, Oracle has continued the tradition of bestowing the award, and in fact has made the process more community-oriented by accepting nominations from the public and involving Java User Groups in the judging effort.
For the 2012 awards, I’m happy to report that Apache Hadoop is among the awardees - which also include the United Nations High Commission for Refugees, Liquid Robotics, and Java cloud company Jelastic Inc., among others.
As Doug Cutting, the Hadoop project’s founder, current ASF chairman, and Cloudera’s chief architect, explains in the Java Magazine writeup about the award, “Java is the primary language of the Hadoop ecosystem…and Hadoop is the de facto standard operating system for big data. So, as the big data trend spreads, Java spreads too.”
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
If you’re interested in community meetups as well, refer to my post from last week on that subject – several are planned.
|An Introduction to Hadoop||Mark Fei||Tues., Oct. 23||9am|
|Using HBase||Amandeep Khurana, Matteo Bertozzi||Tues., Oct. 23||9am|
|Testing Hadoop Applications||Tom Wheeler||Tues., Oct. 23||9am|
|Building a Large-scale Data Collection System Using Flume NG||Hari Shreedharan, Will McQueen, Arvind Prabhakar, Prasad Mujumdar, Mike Percy||Tues., Oct. 23||1:30pm|
|Given Enough Monkeys – Some Thoughts on Randomness||Jesse Anderson||Tues., Oct. 23||3:20pm|
|Keynote: Big Answers||Mike Olson||Weds., Oct. 24||8:55am|
|Large Scale ETL with Hadoop||Eric Sammer||Weds., Oct. 24||11:40am|
|HDFS – What is New and Future||Todd Lipcon (co-presenter)||Weds., Oct. 24||4:10pm|
|High Availability for the HDFS NameNode: Phase 2||Aaron Myers, Todd Lipcon||Weds., Oct. 24||5pm|
|Plenary Session: Beyond Batch||Doug Cutting||Thurs., Oct. 25||9:20am|
|Upcoming Enterprise Features in Apache HBase 0.96||Jon Hsieh||Thurs., Oct. 25||11:40am|
|Data Science on Hadoop: What’s There and What’s Missing||Justin Erickson||Thurs., Oct. 25||1:40pm|
|Taming the Elephant – Learn How Monsanto Manages Their Hadoop Cluster to Enable Genome/Sequence Processing||Bala Venkatrao, Aparna Ramani (with others)||Thurs., Oct. 25||4:10pm|
|Knitting Boar||Josh Patterson, Michael Katzenellenbogen||Thurs., Oct. 25||4:10pm|
Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.
Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)
As you can see, these meetups are highly parallel, so you will either have to make careful choices or have very quick feet. The good news is: there’s something for everybody.
Apache HBase junkies, this one’s for you: I had an opportunity recently for a quick chat with the authors of HBase in Action (Manning Publications – download sample chapter PDF), by Nick Dimiduk and Cloudera’s Amandeep Khurana.
Why did you write HBase in Action?
As I mentioned in my inaugural post last week, it’s important to shine a spotlight on the Cloudera engineers who have a hand in making the Hadoop projects run. It’s an obvious point, and yet an overlooked one, that a community is an aggregation of individual personalities who have diverse backgrounds and interests yet a shared passion for the group and its goals. As Jono Bacon puts it in his seminal 2009 book The Art of Community, “The building blocks of a community are its teams, and the material that makes these blocks are people.”
Thus, welcome to the first installment of our “Meet the Engineers” series, in which we will briefly introduce you to some of the engineer-individuals helping to build the foundations of Hadoop. Today, it’s Aaron T. Myers, aka ATM!
The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!
Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.
The dataset consists of 339 tab-separated text files. Each file contains about 3,000 songs and each song is represented as one separate line of text. The dataset is publicly available and you can find it at Infochimps or Amazon S3. Since the total size of this data sums up to around 218GB, processing it using one machine may take a very long time.