Cloudera Blog · CDH Posts
Cloudera Impala has many exciting features, but one of the most impressive is the ability to analyze data in multiple formats, with no ETL needed, in HDFS and Apache HBase. Furthermore, you can use multiple frameworks, such as MapReduce and Impala, to analyze that same data. Consequently, Impala will often run side-by-side with MapReduce on the same physical hardware, with both supporting business-critical workloads. For such multi-tenant clusters, Impala and MapReduce both need to perform well despite potentially conflicting demands for cluster resources.
In this post, we’ll share our experiences configuring Impala and MapReduce for optimal multi-tenant performance. Our goal is to help users understand how to tune their multi-tenant clusters to meet production service level objectives (SLOs), and to contribute to the community some test methods and performance models that can be helpful beyond Cloudera.
Defining Realistic Test Scenarios
Cloudera’s broad and diverse customer base makes it a top concern to do testing for real-world scenarios. Realistic tests based on common use cases offer meaningful guidance, whereas guidance based on contrived, unrealistic testing often fails to translate to real-life deployments.
As you may know, Apache HBase has a vibrant community and gets a lot of contributions from developers worldwide. The collaborative development effort is so active, in fact, that a new point-release comes out about every six weeks (with the current stable branch being 0.94).
At Cloudera, we’re committed to ensuring that CDH, our open source distribution of Apache Hadoop and related projects (including HBase), ships with the results of this steady progress. Thus, CDH 4.2 was rebased on 0.94.2, as compared to its predecessor CDH 4.1, which was based on 0.92.1. CDH 4.3 has moved one step further and is rebased on 0.94.6.1.
Apart from the rebase, CDH 4.3 also has some important bug fixes backported from later versions of 0.94 and trunk. Following are some of the important features and improvements that now ship in CDH 4.2/CDH 4.3:
Yesterday we announced the availability of Cloudera Manager 4.6. As part of this release, the Free Edition of Cloudera Manager (now a part of Cloudera Standard) has been enhanced significantly to include many features formerly only available with a subscription license:
Today is a big day: Cloudera is not only urging our customers to “Unaccept the Status Quo” (the continued and accelerating spending on data warehousing, expensive data storage, and associated software licenses), but we also announced that Cloudera Search has entered public beta. Now anyone who knows how to do a Google search can query data stored in Cloudera’s Platform for Big Data.
In this post, however, I’d like to explain the new, simpler product naming/packaging structure that will make adopting and deploying Cloudera more straightforward.
Introducing Cloudera Standard
From now on, in addition to CDH, our 100% open source distribution of Apache Hadoop and related projects that is always available to whoever wants to try it, we will offer customers two options that also include Cloudera Manager, our management automation software:
One of the unexpected pleasures of open source development is the way that technologies adapt and evolve for uses you never originally anticipated.
Seven years ago, Apache Hadoop sprang from a project based on Apache Lucene, aiming to solve a search problem: how to scalably store and index the internet. Today, it’s my pleasure to announce Cloudera Search, which uses Lucene (among other things) to make search solve a Hadoop problem: how to let non-technical users interactively explore and analyze data in Hadoop.
Cloudera Search is released to public beta, as of today. (See a demo here; get installation instructions here.) Powered by Apache Solr 4.3, Cloudera Search allows hundreds of users to search petabytes of Hadoop data interactively.
I’m pleased to announce that CDH 4.3 is released and available for download. This is the third quarterly update to our GA shipping CDH 4 line and the 17th significant release of our 100% open source Apache Hadoop distribution.
CDH 4.3 is primarily focused on maintenance. There are more than 400 bug fixes included in this release across the components of the CDH stack. This represents a great step forward in quality, security, and performance.
There are also a few new features in this release. One new feature is the ability of HDFS to rebalance within a datanode. This is a great (configurable) way to help prevent drive failure and maintain performance without having to run more disruptive cluster-wide rebalances. Hue has also received a number of new features, including a Pig editor and support for using the HDFS trash bin.
This week I’d like to highlight King.com, a European social gaming giant that recently claimed the throne for having the most daily active users (more than 66 million). King.com has methodically and successfully expanded its reach beyond mainstream social gaming to dominate the mobile gaming market — it offers a streamlined experience that allows gamers to pick up their gaming session from wherever they left off, in any game and on any device. King.com’s top games include “Candy Crush Saga” and “Bubble Saga”.
And — you guessed it — King.com runs on CDH.
With a business model that offers all games for free, King.com relies advertising and in-game products like boosters and extra lives to generate revenue. In other words, it has to be smart in every communication with customers in order to create value for both the gamer and the advertiser.
Have you ever wished you could upgrade to the latest CDH minor release with just a few mouse clicks, and even without taking any downtime on your cluster? Well, with Cloudera Manager 4.5 and its new “Parcel” feature, you can!
That release introduced many new features and capabilities related to parcels, and in this FAQ-oriented post, you will learn about most of them.
What are parcels?
Parcel is an alternative binary distribution format supported for the first time in Cloudera Manager 4.5. There are a few notable differences between parcels and traditional CDH rpm/deb packages:
At Cloudera, we have the privilege of helping thousands of developers learn Apache Hadoop, as well as build and deploy systems and applications on top of Hadoop. While we (and many of you) believe that platform is fast becoming a staple system in the data center, we’re also acutely aware of its complexities. In fact, this is the entire motivation behind Cloudera Manager: to make the Hadoop platform easy for operations staff to deploy and manage.
So, we’ve made Hadoop much easier to “consume” for admins and other operators — but what about for developers, whether working for ISVs, SIs, or users? Until now, they’ve largely been on their own.
That’s why we’re really excited to announce the Cloudera Developer Kit (CDK), a new open source project designed to help developers get up and running to build applications on CDH, Cloudera’s open source distribution including Hadoop, faster and easier than before. The CDK is a collection of libraries, tools, examples, and documentation engineered to simplify the most common tasks when working with the platform. Just like CDH, the CDK is 100% free, open source, and licensed under the same permissive Apache License v2, so you can use the code any way you choose in your existing commercial code base or open source project.
This week, the Cloudera Sessions head to Washington, DC, and Columbus, Ohio, where attendees will hear from AOL, Explorys, and Skybox Imaging about the ways Apache Hadoop can be used to optimize digital content, to improve the delivery of healthcare, and to generate high-resolution images of the entire globe that provide value to retailers, farmers, government organizations and more.
I’d like to take this opportunity to shine a spotlight on Skybox Imaging, an innovative company that is putting Hadoop to work to help us see the world more clearly, literally.
Skybox’s vice president of ground software, Ollie Guinan, recently posted a guest blog to Cloudera.com to give readers a glimpse into their Hadoop use case, which I’d like to promote again here. I would encourage anyone in the DC area to meet Ollie (who is also a Champion of Big Data) in person at the Cloudera Sessions event in DC this Tuesday to learn more about Skybox and its fascinating use case.