Cloudera Engineering Blog · Community Posts
With HBaseCon 2013 (Early Bird registration now open!) preparations in full swing, you may be interested in learning a bit about the personalities behind the Program Committee, who are tasked with formulating a compelling, community-focused agenda.
Recently I had a chance to ask committee members Gary Helmling (Twitter), Lars Hofhansl (Salesforce.com), Jon Hsieh (Cloudera), Doug Meil (Explorys), Andrew Purtell (Intel), Enis Söztutar (Hortonworks), Michael Stack (Cloudera), and Liyin Tang (Facebook) a few questions:
In this installment, meet Cloudera Software Engineer/Apache Bigtop Committer Mark Grover (@mark_grover).
Editor’s note (12/19/2013): Cloudera ML has been merged into the Oryx project. The information below is still valid though.
Last month, Apache Crunch became the fifth project (along with Sqoop, Flume, Bigtop, and MRUnit) to go from Cloudera’s github repository through the Apache Incubator and on to graduate as a top-level project within the Apache Software Foundation. As the founder of the project and a newly minted Apache VP, I wanted to take this opportunity to express my gratitude to the Crunch community, who have taught me that leadership in the Apache Way means service, humility, and investing more time in building a community than I spend writing code. Working with you all on our shared vision is the highlight of every work week.
Creating Analytical Applications with Crunch: Cloudera ML
Hadoop Summit Europe is coming up in Amsterdam next week, so this is an appropriate time to make you aware of the Cloudera speaker program there (all three talks on Thursday, March 21):
Every growing, dynamic engineering culture needs a hackathon every once in a while.
Earlier this week, Cloudera put that thought into action with a two-day, around-the-clock “What the Hack!” internal hackathon in our Palo Alto offices, with our friends from Accel Partners underwriting the omnipresent food and beverage (thanks!). The carrot: “Fun surprise awards, and most important, the rights to brag about your cool hacking ideas.”
It has been a busy time for announcements coinciding with this week’s Strata conference. There’s no corner of the technology world that has not embraced Apache Hadoop as the new platform for big data. Apache Hadoop began as a telegram from the future from Google, turned into real software by Doug Cutting while on a freelance assignment. While Hadoop’s origins are surprising, its ongoing popularity is not – open source has been a major contributing factor to Hadoop’s current ubiquity. Easy to trial, fast to evolve, inexpensive to own: open source makes a compelling case for itself.
From the founding of the company, Cloudera recognized the importance of Apache open source to Hadoop’s continued evolution. We’re now entering our fifth year of shipping a 100% open source platform. Every significant advance we have added to the platform has stayed consistent to our open source strategy. In the process Cloudera has now sponsored the development of seven new open source projects including Apache Flume, Apache Sqoop, Apache Bigtop, Apache MRUnit, Cloudera Hue, Apache Crunch, and most recently, Cloudera Impala. Acknowledging the maxim “innovation happens elsewhere,” we’ve also managed to convince the founders and/or PMC chairs of Apache Hadoop, Apache Oozie, Apache Zookeeper, and Apache HBase to come join Cloudera.
This blog was originally published at blog.apache.org/pig and is republished here for your convenience by permission of its author, Pig Committer Dmitriy Ryaboy.
After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 – it’s a great way to contribute to open source software.
(Added Feb. 25 2013: Early Bird registration is now open – closes April 23, 2013!)
Clouderans are traveling the United States (and beyond) in droves during the first quarter of 2013 to present at developer meetups and conferences. If you’re interested in attending one near you, we’ve listed them below – see links for specific topics (but note that some of the sites involved may not reflect complete event details yet; check back later for updates).
Let me point out in particular that:
(Update 2/6/2013 – Sorry, this event is sold out!)
With Strata Conference 2013 coming to town (Feb. 26-28, in Santa Clara, Calif.), we thought it would be a great opportunity to open our Palo Alto office’s doors for a pre-conference “Data Hacking Day” on Monday, Feb. 25!
Our hearty congratulations to the Cloudera engineers who have been accepted as ApacheCon NA 2013 (Feb. 26-28 in Portland, OR) speakers for these talks:
So, you want to report a bug, propose a new feature, or contribute code or doc to Apache Hadoop (or a related project), but you don’t know what to do and where to start? Don’t worry, you’re not alone.
Let us help: in this 24-minute screencast, Clouderan Jeff Bean (@jwfbean) offers a step-by-step tutorial that explains why and how to contribute. Apache JIRA ninjas need not view, but anyone else with slight (or less) familiarity with that curious beast will find this information very helpful.
AssignmentManager is a module in the Apache HBase Master that manages regions to RegionServers assignment. (See HBase architecture for more information.) It ensures that all regions are assigned and each region is assigned to just one RegionServer.
Although the AssignmentManager generally does a good job, the existing implementation does not handle assignments as well as it could. For example, if a region was assigned to two or more RegionServers, some regions were stuck in transition and never got assigned, or unknown region exceptions were thrown in moving a region from one RegionServer to another.
Since the Cloudera Impala announcement of a few weeks ago, we’ve been busy partnering-up with Hadoop meetups around the country (and beyond) to bring Impala tech talks directly to the community. Here’s the list for the remainder of 2012, thus far:
The FutureBI meetup is excitedly preparing to host Cloudera CEO Mike Olson at its upcoming meetup on Nov. 6 at the Berkeley School of Information, where he’ll be joined by SiSenseVP of Marketing Bruno Aziza in a conversation titled “The Future of Big Data: We depend on you!” The event will be open to the public as well as to Berkeley students and faculty.
Questions are encouraged during this informal and interactive session, which will cover topics ranging from the evolution of open source software, the changing entrepreneurial ecosystem in the Bay Area, and the likely future of information management. Founders, geeks, and tech industry professionals are all welcome to attend and join the discussion!
Cloudera is co-presenting the sold-out Strata Conference + Hadoop World in New York this week, and if you’re an attendee, you have a great week ahead!
Here’s a quick guide to where you can find Clouderans during the conference. There are of course many other great activities planned as well that are not covered here.
Earlier this month the Apache Hadoop PMC released Apache Hadoop 2.0.2-alpha, which fixes over 600 issues since the previous release in the 2.0 series, 2.0.1-alpha, back in July. This is a tremendous rate of development, of which all contributors to the project should feel proud.
Some of the more noteworthy changes in this release include:
Apache HBase will have a notable profile at ApacheCon Europenext month. Clouderan and HBase committer Lars George has two sessions on the schedule:
In this installment of “Meet the Engineers”, meet Todd Lipcon (@tlipcon), PMC member/committer for the Hadoop, HBase, and Thrift projects.
StumbleUpon (SU) and Cloudera have signed a technology collaboration agreement. Cloudera will support the SU clusters, and in exchange, Cloudera will have access to a variety of production deploys on which to study and try out beta software.
As part of the agreement, the StumbleUpon Apache HBase+Apache Hadoop team — Jean-Daniel Cryans, Elliott Clark and I — have joined Cloudera. From our new perch up in the Cloudera San Francisco office — 10 blocks north and 11 floors up — we will continue as first-level support for SU clusters, tending and optimizing them as we have always done. The rest of our time will be spent helping develop Apache HBase as the newest additions to Cloudera’s HBase team.
For those of you new to it, the Duke’s Choice Awards program was initiated by Sun Microsystems in 2002 in an effort to “celebrate extreme innovation in the world of Java technology” – in essence, it’s the “MVP” of the Java ecosystem. Since it acquired Sun in 2009, Oracle has continued the tradition of bestowing the award, and in fact has made the process more community-oriented by accepting nominations from the public and involving Java User Groups in the judging effort.
For the 2012 awards, I’m happy to report that Apache Hadoop is among the awardees - which also include the United Nations High Commission for Refugees, Liquid Robotics, and Java cloud company Jelastic Inc., among others.
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.
Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)
Apache HBase junkies, this one’s for you: I had an opportunity recently for a quick chat with the authors of HBase in Action (Manning Publications – download sample chapter PDF), by Nick Dimiduk and Cloudera’s Amandeep Khurana.
As I mentioned in my inaugural post last week, it’s important to shine a spotlight on the Cloudera engineers who have a hand in making the Hadoop projects run. It’s an obvious point, and yet an overlooked one, that a community is an aggregation of individual personalities who have diverse backgrounds and interests yet a shared passion for the group and its goals. As Jono Bacon puts it in his seminal 2009 book The Art of Community, “The building blocks of a community are its teams, and the material that makes these blocks are people.”
The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!
Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.
Hello World: This is my first post as the new guy facilitating and coordinating developer community outreach for Cloudera. I am extremely excited to become a new node in the Apache Hadoop ecosystem and involved in the global, thriving community of developers coding against Hadoop and its related projects.
My most recent experience involves deep interest in the Java ecosystem – which achieved ubiquity not only for technical reasons, but also thanks to a vast group of passionate users who committed themselves to Java’s adoption. There are many lessons there for those of us who want to see the Hadoop edition meet that standard.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.
Since Hadoop is written in Java and closely interoperates with the underlying OS, Cloudera Support troubleshoots not only all 17 components in the Hadoop ecosystem, but also any underlying Linux and Java bugs. Last weekend many of our customers were affected by the now infamous “leap second” bugs. Initially, many assumed that Java and Linux would process the leap second gracefully. However, we soon discovered that this wasn’t the case and depending on the version of Linux being used, several distinct issues were observed.
This is a guest re-post from Datameer’s Director of Marketing, Rich Taylor. The original post can be found on the Datameer blog.
Datameer uses D3.js to power our Business Infographic™ designer. I thought I would show how we visualized the Apache Hadoop ecosystem connections. First using only D3.js, and second using Datameer 2.0.
HBaseCon 2012 summation provided by Michael Stack, PMC Chair of the Apache HBase Project. HBase Hack-a-thon summation provided by David Wang, Engineering Manager for the Cloudera HBase team.
HBaseCon 2012 Summation
On Tuesday, June 12th The Churchill Club of Silicon Valley hosted a panel discussion on Hadoop’s evolution from an open-source project to becoming a standard component of today’s enterprise computing fabric. The lively and dynamic discussion was moderated by Cade Metz, Editor, Wired Enterprise.
Michael Driscoll, CEO, Metamarkets
Andrew Mendelsohn, SVP, Oracle Server Technologies
Mike Olson, CEO, Cloudera
Jay Parikh, VP Infrastructure Engineering, Facebook
John Schroeder, CEO, MapR
Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting Optimizing MapReduce Job Performance at Hadoop Summit.
Question: Tell us about your current role and how you interact with Apache Hadoop?
Todd: I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open source projects like Apache Hadoop and Apache HBase. Most recently I’ve been implementing the automatic HA failover feature in Hadoop 2.0, but I’ve also spent a lot of time working on understanding and improving performance of the Hadoop stack.
Question: Tell us about your Hadoop Summit presentation?
This post was originally posted on the Apache Software Foundation’s blog.
We (the Apache MRUnit team) have just released Apache MRUnit 0.9.0-incubating (tarball, nexus, javadoc). Apache MRUnit is an Apache Incubator project that is a Java library which helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.
HBaseCon 2012 is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out.
Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. This means that when you want to use Hadoop for real-time data processing, HBase is the project you are looking for. The HBase developer community includes contributors from many organizations such as StumbleUpon, Facebook, Salesforce.com, TrendMicro, eBay, Explorys, Huawei and Cloudera. In fact, the HBaseCon Program Committee, constructors of the HBaseCon 2012 agenda, are all committers and PMC members of the Apache HBase project.
Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:
Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.
HDFS has long been considered a highly reliable file system. An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.
More than 150 people attended the San Francisco Bay Area HBase User Group meetup last Thursday, January 19th, at eBay headquarters in San Jose, California. Presenters from StumbleUpon, Facebook, eBay and MapR shared a wealth of information about Apache HBase operations and optimizations, gleaned from their experience running HBase in production environments.
One special item of note: Michael Stack announced HBaseCon 2012, taking place this spring in the Bay Area. This inaugural conference will focus on the growth and education of the HBase community. While details of the event are not yet published, the call for speakers is currently open. Submit your abstract here.
Last November in New York City, Hadoop World, the largest conference of Apache Hadoop practitioners, developers, business executives, industry luminaries and innovative companies took place. The enthusiasm for the possibilities in Big Data management and analytics with Hadoop was palpable across the conference. Cloudera CEO, Mike Olson, summarizes Hadoop World 2011 in these final remarks.
Those who attended Hadoop World know how difficult navigating a route between two days of five parallel tracks of compelling content can be—particularly since Hadoop World 2011 consisted of sixty-five informative sessions about Hadoop. Understanding that it is nearly impossible to obtain and/or retain all the valuable information shared live at the event, we have compiled all the Hadoop World presentation slides and videos for perusing, sharing and for reference at your convenience. You can turn to these resources for technical Hadoop help and real-world production Hadoop examples, as well as information about advanced data science analytics.
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
Apache Sqoop (incubating) was created to efficiently transfer bulk data between Hadoop and external structured datastores, such as RDBMS and data warehouses, because databases are not easily accessible by Hadoop. Sqoop is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.
Bala Venkatrao is the Director of Product Management at Cloudera.
As many of you know, we recently launched Cloudera Enterprise 3.7. Here’s the link to the press release This release marked a transition from Cloudera Management Suite (CMS) to Cloudera Manager (CM), the industry’s first and most comprehensive management application for Apache Hadoop. Over the last month we have received very positive feedback from our customers. I want to thank again all the Clouderans who spent countless hours bringing this product to market. I also want to take this opportunity to thank our customers for helping us get here, as many of them helped us to prioritize the key features for this release. Several customers have also shared the challenges/use cases from their Hadoop deployments and the need for specific features (more later) in Cloudera Manager. Many customers were actively involved in usability testing sessions for Cloudera Manager, which were immensely helpful!
Oracle selects CDH and Cloudera Manager as the Apache Hadoop Platform for the Oracle Big Data Appliance
Cloudera users gain more choice, tighter Oracle integration. Cloudera partners gain increased validation of their platform choice.
Ed leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.
2011 was a breakthrough year for Apache Hadoop as many more mainstream organizations large and small turned to Hadoop to manage and process Big Data, while enterprise software and hardware vendors have also made Hadoop a prominent part of their offerings. Big Data and Hadoop became synonymous in much of the enterprise discourse, and Big Data interest is not restricted to Big Companies.
Apache Hadoop Releases
Hadoop had three major releases in 2011: 1.0 (AKA 0.20.205.x), 0.22, and 0.23.
This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the Apache HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.
The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.
This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
A year ago I rolled my first Apache Hadoop system into production. Since then, I’ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to Hadoop Meetups in San Francisco, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.
This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.
In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough, and can be painful to maintain over time. A basic binary format may be more compact, but it has the same maintenance issues as text. Furthermore, we needed rich data types including lists and nested records.
David joined us as part of our intern program, and built the prototype for the distributed log search functionality that’s available as part of Cloudera Manager 3.7. He did an awesome job, and wrote the following blog post which, now that CM3.7 has been released, we’re pleased to publish.
My intern project was to build a log searching tool, specialized for Apache Hadoop. My mini-app allows Hadoop cluster admins and operators to search their error logs across many machines, filter by time range, text in the log message, and find the namenode machine, for example. The results are then ordered by time, and shown to the user.