Cloudera Engineering Blog · General Posts
Today’s interview features Todd Lipcon, software engineer for Cloudera. Todd will be presenting Optimizing MapReduce Job Performance at Hadoop Summit.
Question: Tell us about your current role and how you interact with Apache Hadoop?
Todd: I’m a software engineer on Cloudera’s platform engineering team, where I spend most of my time contributing code to open source projects like Apache Hadoop and Apache HBase. Most recently I’ve been implementing the automatic HA failover feature in Hadoop 2.0, but I’ve also spent a lot of time working on understanding and improving performance of the Hadoop stack.
Question: Tell us about your Hadoop Summit presentation?
We’re happy to announce the Beta release of Cloudera Manager 4.0.
We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.
First, there have been a few notable HBase updates. In this release, we’ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see the CDH3 Known Issues and Workarounds page for details.
I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.
CDH4 has a great many enhancements compared to CDH3.
Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. This means that when you want to use Hadoop for real-time data processing, HBase is the project you are looking for. The HBase developer community includes contributors from many organizations such as StumbleUpon, Facebook, Salesforce.com, TrendMicro, eBay, Explorys, Huawei and Cloudera. In fact, the HBaseCon Program Committee, constructors of the HBaseCon 2012 agenda, are all committers and PMC members of the Apache HBase project.
San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.
A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.
This blog was originally posted on the Apache Blog:
Cloudera hosted the Apache Sqoop Meetup last week at Cloudera HQ in Palo Alto. About 20 of the Meetup attendees had not used Sqoop before, but were interested enough to participate in the Meetup on April 4th. We believe this healthy interest in Sqoop will contribute to its wide adoption.
Cloudera will be hosting an Apache HBase hackathon on May 23rd, 2012, the day after HBaseCon 2012. The overall theme of the event will be 0.96 stabilization. If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon. This is a great opportunity to contribute some code towards the project and hang out with other HBasers.
More details are on the hackathon’s Meetup page. Please RSVP so we can better plan lunch, room size, and other logistics for the event. See you there!
Apache Bigtop 0.3.0 (incubating) is now available. This is the first fully integrated, community-driven, 100% Apache Big Data management distribution based on Apache Hadoop 1.0. In addition to a major change in the Hadoop version, all of the Hadoop ecosystem components have been upgraded to the latest stable versions and thoroughly tested:
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_graduates_from_incubator
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.
A few months ago, my colleague Charles Zedlewski wrote a great piece explaining Apache Hadoop version numbering. The post can be summed up with the following diagram:
The Bay Area HBase User Group March 2012 meetup was held at the StumbleUpon offices in San Francisco, California. 80 interested Apache HBasers were in attendance to mingle and listen to the scheduled presentations.
Michael Stack started the meetup by reminding folks to register for HBaseCon 2012 in San Francisco on May 22nd. Nick Dimiduk and Cloudera’s Amandeep Khurana then announced an early access program for their upcoming book, HBase In Action. Interested folks can get a discount for the program by using the code “hbase38.”
One of the more confusing topics in Hadoop is how authorization and authentication work in the system. The first and most important thing to recognize is the subtle, yet extremely important, differentiation between authorization and authentication, so let’s define these terms first:
Authentication is the process of determining whether someone is who they claim to be.
This blog was originally posted on the Apache Software Foundation MRUnit’s blog.
We (the Apache MRUnit team) have just released Apache MRUnit 0.8.1-incubating. Apache MRUnit is an Apache Incubator project. MRUnit is a Java library that helps developers unit test Apache Hadoop MapReduce jobs. Unit testing is a technique for improving project quality and reducing overall costs by writing a small amount of code that can automatically verify the software you write performs as intended. This is considered a best practice in software development since it helps identify defects early, before they’re deployed to a production system.
Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.
HDFS has long been considered a highly reliable file system. An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.
In Building and Deploying MR2 we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to set up a single-node cluster. This blog provides developers with architectural details of the new MapReduce design.
Apache Hadoop 0.23 has major improvements over previous releases. Here are a few highlights on the MapReduce front; note that there are also major HDFS improvements, which are out of scope of this post.
MapReduce 2.0 (a.k.a. MRv2 or YARN):
ZooKeeper 3.4 is incorporated into CDH4 and now available in beta 1!
I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.
There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:
Keeping with our release policy for Cloudera’s Distribution Including Apache Hadoop (CDH) I’m pleased to announce the availability of update 3 for CDH3. As a reminder, we ship updates for our most recent GA distribution every 3 months. Updates primarily include bug fixes but when possible we will also include features from our mid-term roadmap. We’ll only include new features when they do not introduce instability or break compatibility. As always, users have the option to skip updates without incurring any future upgrade cost.
Update 3 contains a number of new improvements. Several improvements positively impact performance. Enhancements were made to HDFS and to HBase which will result in 15-150% improvements in performance compared to CDH3 update 2 depending on the workload. Users should see performance gains in a wide range of workloads from MapReduce over HDFS style workloads to HBase scan style workloads to HBase random read / write workloads. Todd Lipcon’s talk at Hadoop World on performance outlines a number of these improvements that have made it to update 3. Some of these performance improvements require users to select specific configuration settings so please consult the documentation.
More than 150 people attended the San Francisco Bay Area HBase User Group meetup last Thursday, January 19th, at eBay headquarters in San Jose, California. Presenters from StumbleUpon, Facebook, eBay and MapR shared a wealth of information about Apache HBase operations and optimizations, gleaned from their experience running HBase in production environments.
One special item of note: Michael Stack announced HBaseCon 2012, taking place this spring in the Bay Area. This inaugural conference will focus on the growth and education of the HBase community. While details of the event are not yet published, the call for speakers is currently open. Submit your abstract here.
When most people first hear about data science, it’s usually in the context of how prominent web companies work with very large data sets in order to predict clickthrough rates, make personalized recommendations, or analyze UI experiments. The solutions to these problems require expertise with statistics and machine learning, and so there is a general perception that data science is intimately tied to these fields. However, in my conversations at academic conferences and with Cloudera customers, I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.
The Practice of Data Science
The term “data science” has been subject to criticism on the grounds that it doesn’t mean anything, e.g., “What science doesn’t involve data?” or “Isn’t data science a rebranding of statistics?” The source of this criticism could be that data science is not a solitary discipline, but rather a set of techniques used by many scientists to solve problems across a wide array of scientific fields. As DJ Patil wrote in his excellent overview of building data science teams, the key trait of all data scientists is the understanding “that the heavy lifting of [data] cleanup and preparation isn’t something that gets in the way of solving the problem: it is the problem.”
Today the Apache HBase community has proudly released Apache HBase 0.92.0, a major new version of the scalable distributed data store inspired by Google’s BigTable. Over 670 issues were addressed, so in this post I’ll highlight some of the major features and enhancements and describe what they mean for HBase users, admins, and developers.
While the most visible change to the project is the new project logo, the most important changes for users are the performance and robustness improvements to HBase’s core functionality. On the performance side, there are a few major highlights:
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
Apache Sqoop (incubating) was created to efficiently transfer bulk data between Hadoop and external structured datastores, such as RDBMS and data warehouses, because databases are not easily accessible by Hadoop. Sqoop is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/sqoop.
If you’re like a myriad of other systems administrators out there, you may be running a production Hadoop cluster, spec’ing one out, or just starting to investigate the possibility of bringing Hadoop into your workplace. As any of these folks will be able to tell you, one of the most important tasks you’ll encounter is capacity planning. With the release of Cloudera Manager 3.7, we’re bringing you a new set of tools to aid you in this process. In this post, we’ll take a look at how you can leverage Cloudera Manager to deal with some common scenarios that you might run into while planning out a Hadoop cluster.
Questions and Patterns
How is my disk usage growing over time?
Oracle selects CDH and Cloudera Manager as the Apache Hadoop Platform for the Oracle Big Data Appliance
Cloudera users gain more choice, tighter Oracle integration. Cloudera partners gain increased validation of their platform choice.
Ed leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.
Great news! The InfoWorld Tech Center has chosen Apache Hadoop for a 2012 Technology of the Year Award. Judged by InfoWorld Test Center editors and reviewers, the annual awards identify the best and most innovative products on the IT landscape. Winners are drawn from all of the products tested during the past year, with the final selections made by InfoWorld’s Test Center staff. All products reviewed by the Test Center are eligible to win, and we at Cloudera are very excited that Hadoop was named among the finalists.
I joined Cloudera in 2011 and it’s been very exciting for me to join the Hadoop community and participate in what, by all accounts, was a landmark year. It’s been fantastic to see how Hadoop has empowered companies of every size and in every industry to do new and interesting things with their data. And this is just the beginning. 2012 promises to bring even more innovation and great use cases for this game-changing platform.
2011 was a breakthrough year for Apache Hadoop as many more mainstream organizations large and small turned to Hadoop to manage and process Big Data, while enterprise software and hardware vendors have also made Hadoop a prominent part of their offerings. Big Data and Hadoop became synonymous in much of the enterprise discourse, and Big Data interest is not restricted to Big Companies.
Apache Hadoop Releases
Hadoop had three major releases in 2011: 1.0 (AKA 0.20.205.x), 0.22, and 0.23.
Some users & customers have asked about the most recent release of Apache Hadoop, v1.0: what’s in it, what it followed and what it preceded. To explain this we should start with some basics of how Apache projects release software:
By and large, in Apache projects new features are developed on a main codeline known as “trunk.” Occasionally very large features are developed on their own branches with the expectation they’ll later merge into trunk. While new features usually land in trunk before they reach a release, there is not much expectation of quality or stability. Periodically, candidate releases are branched from trunk. Once a candidate release is branched it usually stops getting new features. Bugs are fixed and after a vote, a release is declared for that particular branch. Any member of the community can create a branch for a release and name it whatever they like.
This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the Apache HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.
The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.
Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.
Before getting into indexing within Hadoop, let us review the features of Lucene and Solr:
Apache Lucene and Apache Solr
Apache HBase 0.90.5 is now available. This release of the scalable distributed data store inspired by Google’s BigTable is a fix release that covers 81 issues, including 5 considered blockers, and 11 considered critical. The release addresses several robustness and resource leakage issues, fixes rare data-loss scenarios having to do with splits and replication, and improves the atomicity of bulk loads. This version includes some new supporting features including improvements to hbck and an offline meta-rebuild disaster recovery mechanism.
The 0.90.5 release is backward compatible with 0.90.4. Many of the fixes in this release will be included as part of CDH3u3.
Apache Whirr release 0.7.0 is now available. It includes changes covering over 50 issues, four of which were considered blockers. Whirr is a tool for quickly starting and managing clusters running on cloud services like Amazon EC2. This is the first Whirr release as a top level Apache project (previously releases were under the auspices of the Incubator). In addition to improving overall stability some of the highlights are described below:
Aparna Ramani is the Director of Engineering for Cloudera Enterprise.
Cloudera Manager 3.7, a major new version of Cloudera’s Management applications for Apache Hadoop, is now available. Cloudera Manager Free Edition is a free download, and the Enterprise edition of Cloudera Manager is available as part of the Cloudera Enterprise subscription.
This blog was originally posted on the Apache Blog: https://blogs.apache.org/flume/entry/flume_ng_architecture
Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at http://incubator.apache.org/flume. Flume NG is work related to new major revision of Flume and is the subject of this post.
This guide is intended to be an introduction to Crunch.
Crunch is used for processing data. Crunch builds on top of Apache Hadoop to provide a simpler interface for Java programmers to process data. In Crunch you create pipelines, not unlike Unix pipelines, such as the command below:
San Francisco, Salesforce.com HQ - Recently there was an Apache HBase Pow-wow where project contributors gathered to discuss the directions of future releases of HBase in person. This group included a quorum of the core committers from Facebook, StumbleUpon, Salesforce, eBay, and Cloudera as well as many contributors and users from other companies. This was an open discussion, and in compliance with Apache Software Foundation policies, the agenda and detailed minutes were shared with the community at large so that everyone can chime in before any final decisions are made.
We summarize some of the high-level discussion topics:
The amount of information we are exposed to on a daily basis is far outstripping our ability to consume it, leaving many of us overwhelmed by the amount of new content available. Ideally we’d like machines and algorithms to help find the more interesting—to individual preference—items so we can easily focus our attention on these items of relevance.
Have you ever been recommended a friend on Facebook? Or an item you might be interested in on Amazon? If so then you’ve benefitted from the value of recommendation systems. Recommendation systems apply knowledge discovery techniques to the problem of making recommendations that are personalized for each user. Recommendation systems are one way we can use algorithms to help us sort through the masses of information to find the “good stuff” in a very personalized way.
Apache ZooKeeper release 3.4.0 is now available: it includes changes covering over 150 issues, 27 of which were considered blockers. ZooKeeper 3.3.3 clients are compatible with 3.4.0 servers, enabling a seamless upgrade path (3.4.0 clients with 3.3.3 servers has also been tested successfully). In addition to improving overall stability some of the highlights are described below:
Things of interest to developers implementing ZooKeeper clients:
This blog was originally posted on the Apache Blog:
Over 30 people attended the inaugural Sqoop Meetup on the eve of Hadoop World in NYC. Faces were put to names, troubleshooting tips were swapped, and stories were topped – with the table-to-end-all-tables weighing in at 28 billion rows.
The Apache Hive team is hard at work putting the finishing touches on the 0.8.0 release. While the release hasn’t reached the GA milestone yet, I think now would be a good time to start highlighting some of the new features and improvements that users can expect to find in this important update:
The infrastructure required to support table indexes was originally added in the 0.7.0 release, but at the time no viable indexing plugin was provided. Project contributors have remedied this situation in the 0.8.0 release with the inclusion of support for bitmap indexes. This is a very important addition to Hive since it promises to significantly increase the performance of queries on indexed tables. More information about Hive Table Indexes can be found in the original design document, as well as in the comments that accompany the Bitmap Index JIRA ticket.
Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.
Background: Adverse Drug Events
An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.
The third annual Hadoop World conference has come and gone. The two days of conference keynotes and sessions were surrounded by receptions, meetups and training classes, and marked by plenty of time for networking in hallways and the exhibit hall. The energy exhibited at the conference was infectious and exchange of ideas outstanding. Nearly 1,500 people – almost double last year’s number – attended. They came from 580 companies, 27 countries and 40 of the United States. Big data is clearly a global phenomenon.
A number of architectural changes have been added to Hadoop MapReduce. The new MapReduce system is called MR2 (AKA MR.next). The first release version to include these changes will be Apache Hadoop 0.23.
A key change in the new architecture is the disappearance of the centralized JobTracker service. Previously, the JobTracker was responsible for provisioning the resources across the whole cluster, in addition to managing the life cycle of all submitted MapReduce applications; this typically included starting, monitoring and retrying the applications individual tasks. Throughout the years and from a practical perspective, the Hadoop community has acknowledged the problems that inherently exist in this functionally aggregated design (See MAPREDUCE-279).
The Apache Hadoop PMC has voted to release Apache Hadoop 0.23.0. This release is significant since it is the first major release of Hadoop in over a year, and incorporates many new features and improvements over the 0.20 release series. The biggest new features are HDFS federation, and a new MapReduce framework. There is also a new build system (Maven), Kerberos HTTP SPNEGO support, as well as some significant performance improvements which we’ll be covering in future posts. Note, however, that 0.23.0 is not a production release, so please don’t install it on your production cluster.
HDFS federation improves HDFS scalability by allowing multiple independent namenodes, each managing a portion of the namespace. Each datanode in the cluster can provide storage to all the namenodes (which means datanodes do not, for example, belong to a single namenode). Note that HDFS federation is not to be confused with HDFS High Availability, which will be coming in a future 0.23 release.
Cloudera believes that the flexibility and power of Apache Mahout (http://mahout.apache.org/) in conjunction with Hadoop is invaluable. Therefore, we have packaged the most recent stable release of Mahout (0.5) into CDH3u2, and we are very excited to work with the Mahout community becoming much more involved with the project as both Mahout & Hadoop continue to grow. You can test our CDH with Mahout integration by downloading our most recent release: https://ccp.cloudera.com/display/DOC/Downloading+CDH+Releases
Why we are packing Mahout with Hadoop?
Machine learning is an entire field devoted to Information Retrieval, Statistics, Linear Algebra, Analysis of Algorithms, and many other subjects. This field allows us to examine things such as recommendation engines involving new friends, love interests, and new products. We can do incredibly advanced analysis around genetic sequencing and examination, distributed search and frequency pattern matching, as well mathematical analysis with vectors, matrices, and singular value decomposition (SVD).
Several meetups for Apache Hadoop and Hadoop-related projects are scheduled for the evenings surrounding Hadoop World 2011. Make the most of your week in New York City by attending one or more of these meetups focusing on the Apache projects Hadoop, HBase, Sqoop, Hive and Flume. Food and beverages will be provided at each meetup. Join us to relax, get informed and network with your fellow conference attendees.