Cloudera Engineering Blog · CDH Posts
Cloudera recently announced the general availability of CDH4.1, an update to our open-source, enterprise-ready distribution of Apache Hadoop and related projects. Among various components, Apache Mahout is a relatively recent addition to CDH (first added to CDH3u2 in 2011), but is already attracting increasing interest out in the field.
Mahout started as a sub-project of Apache Lucene to provide machine-learning libraries in the area of clustering and classification. It later evolved into a top-level Apache project with much broader coverage of machine-learning techniques (clustering, classification, recommendation, frequent itemset mining etc.).
A few weeks back, Cloudera announced CDH 4.1, the latest update release to Cloudera’s Distribution including Apache Hadoop. This is the first release to introduce truly standalone High Availability for the HDFS NameNode, with no dependence on special hardware or external software. This post explains the inner workings of this new feature from a developer’s standpoint. If, instead, you are seeking information on configuring and operating this feature, please refer to the CDH4 High Availability Guide.
Since the beginning of the project, HDFS has been designed around a very simple architecture: a master daemon, called the NameNode, stores filesystem metadata, while slave daemons, called DataNodes, store the filesystem data. The NameNode is highly reliable and efficient, and the simple architecture is what has allowed HDFS to reliably store petabytes of production-critical data in thousands of clusters for many years; however, for quite some time, the NameNode was also a single point of failure (SPOF) for an HDFS cluster. Since the first beta release of CDH4 in February, this issue has been addressed by the introduction of a Standby NameNode, which provides automatic hot failover capability to a backup. For a detailed discussion of the design of the HA NameNode, please refer to the earlier post by my colleague Aaron Myers.
Limitations of NameNode HA in Previous Versions
I am very pleased to announce the availability of Cloudera Manager 4.1. This release adds support for the Cloudera Impala beta release, and management and monitoring of key CDH features.
Here are the highlights of Cloudera Manager 4.1:
After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.
When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.
Today we’re proud to announce a new addition to the Apache Hadoop ecosystem: Cloudera Impala, a parallel SQL engine that runs natively on Hadoop storage. The salient points are:
This is a guest post by Oliver Guinan, VP Ground Software, at Skybox Imaging. Oliver is a 15-year veteran of the internet industry and is responsible for all ground system design, architecture and implementation at Skybox.
One of the great promises of the big data movement is using networks of ubiquitous sensors to deliver insights about the world around us. Skybox Imaging is attempting to do just that for millions of locations across our planet.
Hue is a Web-based interface that makes it easier to use Apache Hadoop. Hue 2.1 (included in CDH4.1) provides a new application on top of Apache Oozie (a workflow scheduler system for Apache Hadoop) for creating workflows and scheduling them repetitively. For example, Hue makes it easy to group a set of MapReduce jobs and Hive scripts and run them every day of the week.
In this post, we’re going to focus on the Workflow component of the new application.
Apache Pig is a platform for analyzing large data sets that provides a high-level language called Pig Latin. Pig users can write complex data analysis programs in an intuitive and compact manner using Pig Latin.
Among many other enhancements, CDH4.1, the newest release of Cloudera’s open-source Hadoop distro, upgrades Pig from version 0.9 to version 0.10. This post provides a summary of the top seven new features introduced in CDH4.1 Pig.
Boolean Data Type
This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.
Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.
Note (added July 8, 2013): The information below is deprecated; we suggest that you refer to this post for current instructions.
Today we bring you one user’s experience using Apache Whirr to spin up a CDH cluster in the cloud. This post was originally published here by George London (@rogueleaderr) based on his personal experiences; he has graciously allowed us to bring it to you here as well in a condensed form. (Note: the configuration described here is intended for learning/testing purposes only.)
Metrics are collections of information about Hadoop daemons, events and measurements; for example, data nodes collect metrics such as the number of blocks replicated, number of read requests from clients, and so on. For that reason, metrics are an invaluable resource for monitoring Apache Hadoop services and an indispensable tool for debugging system problems.
This blog post focuses on the features and use of the Metrics2 system for Hadoop, which allows multiple metrics output plugins to be used in parallel, supports dynamic reconfiguration of metrics plugins, provides metrics filtering, and allows all metrics to be exported via JMX.
Metrics vs. MapReduce Counters
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.
In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.
Who is Influential?
For those new to it, Cloudera Manager is the first and market-leading management platform for CDH (Cloudera’s Distribution Including Apache Hadoop). Enterprise customers are coming to expect an end-to-end tool that manages the entire lifecycle of their Hadoop operations. In fact, in a recent Cloudera customer survey, an overwhelming 95% emphasized the need for this approach.
In this installment of “Meet the Engineer”, we meet with Eric Sammer (invariably known as just plain “Sammer”), Apache committer and author of the upcoming O’Reilly book, Hadoop Operations.
What do you do at Cloudera, and in which Apache project are you involved?
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
Learn how to configure a basic Maven project that will be able to build applications against CDH
Apache Maven is a build automation tool that can be used for Java projects. Since nearly all the Apache Hadoop ecosystem is written in Java, Maven is a great tool for managing projects that build on top of the Hadoop APIs. In this post, we’ll configure a basic Maven project that will be able to build applications against CDH (Cloudera’s Distribution Including Apache Hadoop) binaries.
Today ZDNet has very helpfully published a guide to downloading, configuring, and using Cloudera’s Demo VM for CDH4 (available in three flavors, but in this case the VMware version). As the author, Andrew Brust, explains, the VM contains a “pre-built, training-appropriate, 1-node Apache Hadoop cluster” (on top of CentOS). Perhaps most important for boot-strappers, it’s free.
The following is a guest post kindly offered by Adam Kawa, a 26-year old Hadoop developer from Warsaw, Poland. This post was originally published in a slightly different form at his blog, Hakuna MapData!
Recently I have found an interesting dataset, called Million Song Dataset (MSD), which contains detailed acoustic and contextual data about a million songs. For each song we can find information like title, hotness, tempo, duration, danceability, and loudness as well as artist name, popularity, localization (latitude and longitude pair), and many other things. There are no music files included here, but the links to MP3 song previews at 7digital.com can be easily constructed from the data.
In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.
You can catch the full interview (video and transcript versions) here.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
HttpFS is an HTTP gateway/proxy for Apache Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).
HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API. HttpFS also allows clients to access CDH3u4 clusters via the WebHDFS REST API.
In Building and Deploying MR2, we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.
It is important to note that, at the time of writing this blog post, MapReduce 2.0 is still Alpha, and it is not recommended to use it in production.
At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.
Since Hadoop is written in Java and closely interoperates with the underlying OS, Cloudera Support troubleshoots not only all 17 components in the Hadoop ecosystem, but also any underlying Linux and Java bugs. Last weekend many of our customers were affected by the now infamous “leap second” bugs. Initially, many assumed that Java and Linux would process the leap second gracefully. However, we soon discovered that this wasn’t the case and depending on the version of Linux being used, several distinct issues were observed.
Apache HBase is the Hadoop database, and is based on the Hadoop Distributed File System (HDFS). HBase makes it possible to randomly access and update data stored in HDFS, but files in HDFS can only be appended to and are immutable after they are created. So you may ask, how does HBase provide low-latency reads and writes? In this blog post, we explain this by describing the write path of HBase — how data is updated in HBase.
The write path is how an HBase completes put or delete operations. This path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile. Included in the design of the write path are features that HBase uses to prevent data loss in the event of a region server failure. Therefore understanding the write path can provide insight into HBase’s native data loss prevention mechanism.
On Tuesday, June 12th The Churchill Club of Silicon Valley hosted a panel discussion on Hadoop’s evolution from an open-source project to becoming a standard component of today’s enterprise computing fabric. The lively and dynamic discussion was moderated by Cade Metz, Editor, Wired Enterprise.
Michael Driscoll, CEO, Metamarkets
Andrew Mendelsohn, SVP, Oracle Server Technologies
Mike Olson, CEO, Cloudera
Jay Parikh, VP Infrastructure Engineering, Facebook
John Schroeder, CEO, MapR
I’m very pleased to announce the immediate General Availability of CDH4 and Cloudera Manager 4 (part of the Cloudera Enterprise 4.0 subscription). These releases are an exciting milestone for Cloudera customers, Cloudera users and the open source community as a whole.
Both CDH4 and Cloudera Manager 4 are chock full of new features. Many new features will appeal to enterprises looking to move more important workloads onto the Apache Hadoop platform. CDH4 includes high availability for the filesystem, ability to support multiple namespaces, Apache HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations.
We are happy to officially announce the general availability of CDH3 update 4. This update consists primarily of reliability enhancements as well as a number of minor improvements.
First, there have been a few notable HBase updates. In this release, we’ve upgraded Apache HBase to upstream version 0.90.6, improving system robustness and availability. Also, some of the recent hbck changes were incorporated to better detect and handle various types of corruptions. Lastly, HDFS append support is now disabled by default in this release as it is no longer needed for HBase. Please see the CDH3 Known Issues and Workarounds page for details.
This is a guest post by Assaf Yardeni, Head of R&D for Treato, an online social healthcare solution, headquartered in Israel.
Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.
Before the Hadoop era
I’m pleased to inform our users and customers that we have released the Cloudera’s Distribution Including Apache Hadoop version 4 (CDH4) 2nd and final beta today. We received great feedback from the community from the first beta and this release incorporates that feedback as well as a number of new enhancements.
CDH4 has a great many enhancements compared to CDH3.
Apache Hadoop consists of two primary components: HDFS and MapReduce. HDFS, the Hadoop Distributed File System, is the primary storage system of Hadoop, and is responsible for storing and serving all data stored in Hadoop. MapReduce is a distributed processing framework designed to operate on data stored in HDFS.
HDFS has long been considered a highly reliable file system. An empirical study done at Yahoo! concluded that across Yahoo!’s 20,000 nodes running Apache Hadoop in 10 different clusters in 2009, HDFS lost only 650 blocks out of 329 million total blocks. The vast majority of these lost blocks were due to a handful of bugs which have long since been fixed.
Cloudera and Cisco jointly announced a reference architecture for running Cloudera’s Distribution Including Apache Hadoop (CDH) and Cloudera Manager on Cisco’s Unified Computing System (UCS) last November. It was the first Apache Hadoop reference architecture assembled by Cisco, and is proudly certified by Cloudera.
I bring a different perspective on the Cloudera-Cisco relationship, as I worked for over five years in Cisco on the software powering the Nexus 5000 series switches and the Cisco Virtual Interface Card. I now work at Cloudera on the HBase team, and can fully appreciate the synergies that the Cloudera and Cisco reference architecture brings to the table.
Several weeks ago, I set about to demonstrate the ease with which Solr and Map/Reduce can be integrated. I was unable to find a simple, yet comprehensive, primer on integrating the two technologies. So I set about to write one.
What follows is my bare-bones tutorial on getting Solr up and running to index each word of the complete works of Shakespeare. Note: Special thanks to Sematext for looking over the Solr bits and making sure they are sane. Check them out if you’re going to be doing a lot of work with Solr, ElasticSearch, or search in general and want to bring in the experts.
First things first
In Building and Deploying MR2 we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to set up a single-node cluster. This blog provides developers with architectural details of the new MapReduce design.
Apache Hadoop 0.23 has major improvements over previous releases. Here are a few highlights on the MapReduce front; note that there are also major HDFS improvements, which are out of scope of this post.
MapReduce 2.0 (a.k.a. MRv2 or YARN):
I’m pleased to inform our users and customers that Cloudera has released its 4th version of Cloudera’s Distribution Including Apache Hadoop (CDH) into beta today. This release combines the input from our enterprise customers, partners and users with the hard work of Cloudera engineering and the larger Apache open source community to create what we believe is a compelling advance for this widely adopted platform.
There are a great many improvements and new capabilities in CDH4 compared to CDH3. Here is a high level list of what’s available for you to test in this first beta release:
Earlier today, Cloudera proudly released the Cloudera Connector for Tableau. The availability of this connector serves both Tableau users who are looking to expand the volume of datasets they manipulate and Hadoop users who want to enable analysts like Tableau users to make the data within Hadoop more meaningful. Enterprises can now extract the full value of big data and allow a new class of power users to interact with Hadoop data in ways they priorly could not.
The Cloudera Connector for Tableau is a free ODBC Driver that enables Tableau Desktop 7.0 to connect to Apache Hive. Tableau users can thus leverage Hive, Hadoop’s data warehouse system, as a data source for all the maps, charts, dashboards and other artifacts typically generated within Tableau.
Keeping with our release policy for Cloudera’s Distribution Including Apache Hadoop (CDH) I’m pleased to announce the availability of update 3 for CDH3. As a reminder, we ship updates for our most recent GA distribution every 3 months. Updates primarily include bug fixes but when possible we will also include features from our mid-term roadmap. We’ll only include new features when they do not introduce instability or break compatibility. As always, users have the option to skip updates without incurring any future upgrade cost.
Update 3 contains a number of new improvements. Several improvements positively impact performance. Enhancements were made to HDFS and to HBase which will result in 15-150% improvements in performance compared to CDH3 update 2 depending on the workload. Users should see performance gains in a wide range of workloads from MapReduce over HDFS style workloads to HBase scan style workloads to HBase random read / write workloads. Todd Lipcon’s talk at Hadoop World on performance outlines a number of these improvements that have made it to update 3. Some of these performance improvements require users to select specific configuration settings so please consult the documentation.
When most people first hear about data science, it’s usually in the context of how prominent web companies work with very large data sets in order to predict clickthrough rates, make personalized recommendations, or analyze UI experiments. The solutions to these problems require expertise with statistics and machine learning, and so there is a general perception that data science is intimately tied to these fields. However, in my conversations at academic conferences and with Cloudera customers, I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.
The Practice of Data Science
The term “data science” has been subject to criticism on the grounds that it doesn’t mean anything, e.g., “What science doesn’t involve data?” or “Isn’t data science a rebranding of statistics?” The source of this criticism could be that data science is not a solitary discipline, but rather a set of techniques used by many scientists to solve problems across a wide array of scientific fields. As DJ Patil wrote in his excellent overview of building data science teams, the key trait of all data scientists is the understanding “that the heavy lifting of [data] cleanup and preparation isn’t something that gets in the way of solving the problem: it is the problem.”
Oracle selects CDH and Cloudera Manager as the Apache Hadoop Platform for the Oracle Big Data Appliance
Cloudera users gain more choice, tighter Oracle integration. Cloudera partners gain increased validation of their platform choice.
Ed leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.
2011 was a breakthrough year for Apache Hadoop as many more mainstream organizations large and small turned to Hadoop to manage and process Big Data, while enterprise software and hardware vendors have also made Hadoop a prominent part of their offerings. Big Data and Hadoop became synonymous in much of the enterprise discourse, and Big Data interest is not restricted to Big Companies.
Apache Hadoop Releases
Hadoop had three major releases in 2011: 1.0 (AKA 0.20.205.x), 0.22, and 0.23.
This guest blog post is from Alex Loddengaard, creator of FoneDoktor, an Android app that monitors phone usage and recommends performance and battery life improvements. FoneDoktor uses WibiData, a data platform built on Apache HBase from Cloudera’s Distribution including Apache Hadoop, to store and analyze Android usage data. In this post, Alex will discuss FoneDoktor’s implementation and discuss why WibiData was a good data solution. A version of this post originally appeared at the WibiData blog.
At last month’s Hadoop World, one of the sessions spotlighted FoneDoktor, an Android app that collects data about device performance and app resource usage to offer personalized battery and performance improvement recommendations directly to users. In this post, I’ll talk about how I used WibiData — a system built on Apache HBase from CDH — as FoneDoktor’s primary data storage, access, and analysis system.
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.
Ari Rabkin is a summer intern at Cloudera, working with the engineering team to help make Hadoop more usable and simpler to configure. The rest of the year, Ari is a PhD student at UC Berkeley. He’s applying the results of recent research to automatically find and document configuration options for Hadoop.
Hadoop has a key-value style of configuration, where each configuration option has a name and a value. There is no central list of options, and it’s easy for developers to add new configuration options as needed. Unfortunately, this opens the way for bugs and for erroneous documentation. Not all documented options exist and many options are undocumented. Options can have different default values in the code and the configuration files. (In which case, the config-file default will win.)
Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution. For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.
For those of you who are recent Cloudera users, here is a refresh on our update policy:
Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.
A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.
Philip Zeyliger is a software engineer at Cloudera and started the SCM
Two weeks ago, at Hadoop Summit, we released our Service and Configuration Manager (SCM) Express. It’s a dramatically simpler and faster way to get started with Cloudera’s Distribution including Apache Hadoop (CDH). In a previous blog post, we talked in some detail about SCM Express and what it can do for you.
Phil Langdale is a software engineer at Cloudera and the technical lead for Cloudera’s SCM Express product.