Cloudera Engineering Blog · Cloudera Manager Posts
The best data protection strategy is to remove sensitive information from everyplace it’s not needed.
Have you ever wondered what sort of “sensitive” information might wind up in Apache Hadoop log files? For example, if you’re storing credit card numbers inside HDFS, might they ever “leak” into a log file outside of HDFS? What about SQL queries? If you have a query like
select * from table where creditcard = '1234-5678-9012-3456', where is that query information ultimately stored?
We’re pleased to announce the release of Cloudera Enterprise 5.4 (comprising CDH 5.4, Cloudera Manager 5.4, and Cloudera Navigator 2.3).
Cloudera Enterprise 5.4 (Release Notes) reflects critical investments in a production-ready customer experience through governance, security, performance and deployment flexibility in cloud environments. It also includes support for a significant number of updated open standard components–including Apache Spark 1.3, Impala 2.2, and Apache HBase 1.0 (as well as unsupported beta releases of Hive-on-Spark data processing and OpenStack deployments).
Following these best practices can make your upgrade path to CDH 5 relatively free of obstacles.
Upgrading the software that powers mission-critical workloads can be challenging in any circumstance. In the case of CDH, however, Cloudera Manager makes upgrades easy, and the built-in Upgrade Wizard, available with Cloudera Manager 5, further simplifies the upgrade process. The wizard performs service-specific upgrade steps that, previously, you had to run manually, and also features a rolling restart capability that reduces downtime for minor and maintenance version upgrades. (Please refer to this blog post or webinar to learn more about the Upgrade Wizard).
Cloudera Director 1.1 introduces new features and improvements that provide more options for creating and managing cloud deployments of Apache Hadoop. Here are details about how they work.
Cloudera Director, which was released in October of 2014, delivers production-ready, self-service interaction with Apache Hadoop clusters in cloud environments. You can find background information about Cloudera Director’s purpose and fundamental features in our earlier introductory blog post and technical overview blog post.
An improved upgrade wizard in Cloudera Manager 5.3 makes it easy to upgrade CDH on your clusters.
Upgrades can be hard, and any downtime to mission-critical workloads can have a direct impact on revenue. Upgrading the software that powers these workloads can often be an overwhelming and uncertain task that can create unpredictable issues. Apache Hadoop can be especially complex as it consists of dozens of components running across multiple machines. That’s why an enterprise-grade administration tool is necessary for running Hadoop in production, and is especially important when taking the upgrade plunge.
We’re pleased to announce the release of Cloudera Enterprise 5.3 (comprising CDH 5.3, Cloudera Manager 5.3, and Cloudera Navigator 2.2).
This release continues the drumbeat for security functionality in particular, with HDFS encryption (jointly developed with Intel under Project Rhino) now recommended for production use. This feature alone should justify upgrades for security-minded users (and an improved CDH upgrade wizard makes that process easier).
A significant vulnerability affecting the entire Apache Hadoop ecosystem has now been patched. What was involved?
By now, you may have heard about the POODLE (Padding Oracle On Downgraded Legacy Encryption) attack on TLS (Transport Layer Security). This attack combines a cryptographic flaw in the obsolete SSLv3 protocol with the ability of an attacker to downgrade TLS connections to use that protocol. The result is that an active attacker on the same network as the victim can potentially decrypt parts of an otherwise encrypted channel. The only immediately workable fix has been to disable the SSLv3 protocol entirely.
Installing CDH on newer unsupported operating systems (such as Ubuntu 13.04 and later) can lead to conflicts. These guidelines will help you avoid them.
Some of the more recently released operating systems that bundle portions of the Apache Hadoop stack in their respective distro repositories can conflict with software from Cloudera repositories. Consequently, when you set up CDH for installation on such an OS, you may end up picking up packages with the same name from the OS’s distribution instead of Cloudera’s distribution. Package installation may succeed, but using the installed packages may lead to unforeseen errors.
With Cloudera Director, cloud deployments of Apache Hadoop are now as enterprise-ready as on-premise ones. Here’s the technology behind it.
As part of the recent Cloudera Enterprise 5.2 release, we unveiled Cloudera Director, a new product that delivers enterprise-class, self-service interaction with Hadoop clusters in cloud environments. (Cloudera Director is free to download and use, but commercial support requires a Cloudera Enterprise subscription.) It provides a centralized administrative view for cloud deployments and lets end users provision and scale clusters themselves using automated, repeatable, managed processes. To summarize, the same enterprise-grade capabilities that are available with on-premise deployments are now also available for cloud deployments. (For an overview of and motivation for Cloudera Director, please check out this blog post.)
Cloudera Enterprise’s newest release contains important new security and performance features, and offers support for the latest innovations in the open source platform.
We’re pleased to announce the release of Cloudera Enterprise 5.1 (comprising CDH 5.1, Cloudera Manager 5.1, and Cloudera Navigator 2.0).
It’s been a while since we provided a how-to for this purpose. Thanks, Daan Debie (@DaanDebie), for allowing us to re-publish the instructions below (for CDH 5)!
I recently started as a Big Data Engineer at The New Motion. While researching our best options for running an Apache Hadoop cluster, I wanted to try out some of the features available in the newest version of Cloudera’s Hadoop distribution: CDH 5. Of course I could’ve downloaded the QuickStart VM, but I rather wanted to run a virtual cluster, making use of the 16GB of RAM my shiny new 15″ Retina Macbook Pro has ;)
Unique across all options, Cloudera Manager makes it easy to do what would otherwise be a disruptive operation for operators and users.
For the increasing number of customers that rely on enterprise data hubs (EDHs) for business-critical applications, it is imperative to minimize or eliminate downtime — thus, Cloudera has focused intently on making software upgrades a routine, non-disruptive operation for EDH administrators and users.
Thanks to Bill Podell, VP Big Data and BI Practice, MBI Solutions, for the guest post below.
Capacity planning has long been a critical component of successful implementations for production systems. Today, Big Data calls for a particularly deep understanding of capacity management – because resource utilization explodes as business users, analysts, and data scientists jump onboard to analyze and use newly found data. The resource impact can escalate very quickly, causing poor loading and or response times. The result is throwing more hardware at the issue without any understanding of what impact the new hardware will have on the current issue. Better yet, be proactive and know about the problem before the problem even occurs!
Thanks to Jonathan Natkins of WibiData for the post below about how his company extended Cloudera Manager to manage Kiji. Learn more about Kiji and the organizations using it to build real-time HBase applications at Kiji Sessions, happening on May 6, 2014, the day after HBaseCon.
As a partner of Cloudera, WibiData sees Cloudera Manager’s new extensibility framework as one of the most exciting parts of Cloudera Enterprise 5. Cloudera Manager 5.0.0 provides the single-pane view that Apache Hadoop administrators and operators want to effectively manage a cluster of machines. Additionally, Cloudera Manager now offers tight integration for partners to plug into the CDH ecosystem, which benefits Cloudera as well as WibiData.
In this installment of “Meet the Engineer”, our subject is Andrei Savu!
What do you do at Cloudera?
The GA release of Cloudera Enterprise 5 signifies the evolution of the platform from a mere Apache Hadoop distribution into an enterprise data hub.
We are thrilled to announce the GA release of Cloudera Enterprise 5 (comprising CDH 5.0 and Cloudera Manager 5.0).
Cloudera has released the Beta 2 version of Cloudera Enterprise 5 (comprises CDH 5.0.0 and Cloudera Manager 5.0.0).
This release (download) contains a number of new features and component versions including the ones below:
More and more customers are using automation/configuration management frameworks alongside Cloudera Manager.
As Apache Hadoop clusters continue to grow in size, complexity, and business importance as the foundational infrastructure for an Enterprise Data Hub, the use cases for a robust and mature management console expand.
Learn the new features and enhancements in Cloudera Manager 5, including support for YARN, management of third-party apps and frameworks, and more.
The response to the Oct. 2013 release of Cloudera Enterprise 5 Beta has been overwhelming, and Cloudera is busily working closely with several customers to incorporate their feedback.
Some things for which we are thankful, the 2013 edition (not listed in order):
1. The entire Apache Hadoop community for its constant and hard work to Make the Platform Better,
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.
Installing the SOLR Parcel
We are pleased to announce the beta release of Cloudera Enterprise 5 (CDH 5 and Cloudera Manager 5). This release has both Cloudera Impala and Cloudera Search integrated into CDH. It also includes many new features and updated component versions including the ones below:
I’ve always held a strong bias that education is most effective when the student learns by doing. As a developer of technical curricula, my goal is to have training participants engage with real and relevant problems as much as possible through hands-on exercises. The high rate at which Apache Hadoop is changing, both as a technology and as an ecosystem, makes developing Cloudera training courses not only demanding but also seriously fun and rewarding.
I recently undertook the challenge of upgrading the Cloudera Administrator Training for Apache Hadoop. I more than quadrupled the amount of hands-on exercises from the previous version, adding a full day to the course. At four days, it’s now the most thorough training for Hadoop administrators and truly the best way to start building expertise.
Cloudera Manager 4.7 is an update to Cloudera Manager 4 and contains a number of bug fixes and usability improvements. Furthermore, we have introduced new features such as:
StackIQ takes a “software defined infrastructure” approach to provision and manage cluster infrastructure that sits below Big Data platforms such as Apache Hadoop. In the guest post below, StackIQ co-founder and VP Engineering Greg Bruno explains how to install Cloudera Enterprise on top of StackIQ’s management system so they can work together.
The hardware used for this deployment is a small cluster: one node (i.e. one server) for the StackIQ Cluster Manager and four nodes as backend/data nodes. Each node has two disks and all nodes are connected via 1Gb Ethernet on a Private Network. The Cluster Manager node is also connected to a Public Network using its second NIC. (StackIQ Cluster Manager is used in similar deployments between two nodes and 4,000+ nodes in size.)
The following guest post is re-published here courtesy of Gerd König, a System Engineer with YMC AG. Thanks, Gerd!
Cloudera Manager is a great tool to orchestrate your CDH-based Apache Hadoop cluster. You can use it from cluster installation, deploying configurations, restarting daemons to monitoring each cluster component. Starting with version 4.6, the manager supports the integration of Cloudera Search, which is currently in Beta state. In this post I’ll show you the required steps to set up a Hadoop cluster via Cloudera Manager and how to integrate Cloudera Search.
Cloudera’s new Parcels installation format has been released, and I’m excited to highlight just how useful (and mind-blowingly cool) it is to system administrators and anyone responsible for maintaining a CDH cluster.
If you haven’t read about or played with Parcels, they make components of the distribution significantly easier to manage, install, and upgrade. The new Parcel distribution format works with Cloudera Manager 4.5 and later. When you perform installations and upgrades using Parcels, you get access to new Cloudera Manager features such as:
The following guest post, from Mike Pittaro of Dell’s Cloud Software Solutions team, describes his team’s use of the Dell Crowbar tool in conjunction with the Cloudera Manager API to automate cluster provisioning. Thanks, Mike!
Deploying, managing, and operating Apache Hadoop clusters can be complex at all levels of the stack, from the hardware on up. To hide this complexity and reduce deployment time, since 2011, Dell has been using Dell Crowbar in conjunction with Cloudera Manager to deploy the Dell | Cloudera Solution for Apache Hadoop for joint customers.
This is a great day for technical end-users – developers, admins, analysts, and data scientists alike. Starting now, Cloudera complements its traditional mailing lists with a new, feature-rich community forums intended for users of Cloudera’s Platform for Big Data! (Login using your existing credentials or click the link to register.)
Although mailing lists have long been a standard for user interaction, and will undoubtedly continue to be, they have flaws. For example, they lack structure or taxonomy, which makes consumption difficult. Search functionality is often less than stellar and users are unable to build reputations that span an appreciable period of time. For these reasons, although they’re easy to create and manage, mailing lists inherently limit access to knowledge and hence limit adoption.
In a prior blog post, Omar explained two important concepts introduced in Cloudera Manager 4.5: Role Groups and Host Templates. In this post, I’ll demonstrate how to use role groups and host templates to easily expand an existing CDH cluster onto heterogeneous hardware. If you haven’t already looked at Omar’s post, I’d recommend doing so before reading this one, as I’ll assume you are familiar with role groups and host templates.
Although these instructions/screenshots are premised on Cloudera Manager 4.5, they are valid for subsequent releases as well.
Initial State and Goal
At Cloudera, we believe that Cloudera Manager is the best way to install, configure, manage, and monitor your Apache Hadoop stack. Of course, most users prefer not to take our word for it — they want to know how Cloudera Manager works under the covers, first.
In this post, I’ll explain some of its inner workings.
The Vocabulary of Cloudera Manager
We’re very pleased to bring you this guest post from Verisign engineer Benoit Perroud, which is based on his personal experiences with the new “Parcel” binary distribution format in Cloudera Manager 4.5.
Among all the new features released with Cloudera Manager 4.5, Parcel is probably one of the most unnoticed – despite the fact it has the potential to become the administrator’s best friend.
Yesterday we announced the availability of Cloudera Manager 4.6. As part of this release, the Free Edition of Cloudera Manager (now a part of Cloudera Standard) has been enhanced significantly to include many features formerly only available with a subscription license:
The news this morning focused on the launch of Cloudera Search, an exciting new capability for our platform that was much anticipated by our customers and engineers. Also released at the same time is a new release of Cloudera Manager (4.6).
Cloudera Manager 4.6 includes a number of enhancements as well as improvements in quality and usability. (A follow-on blog post will do a deep dive on the new features and functions.) Most notable in Cloudera Manager 4.6 is that the free version (included in Cloudera Standard) is greatly enhanced. Cloudera Standard now includes monitoring, health checks, events & alerts, log search, kerberos automation, and multi-cluster support.
Today is a big day: Cloudera is not only urging our customers to “Unaccept the Status Quo” (the continued and accelerating spending on data warehousing, expensive data storage, and associated software licenses), but we also announced that Cloudera Search has entered public beta. Now anyone who knows how to do a Google search can query data stored in Cloudera’s Platform for Big Data.
In this post, however, I’d like to explain the new, simpler product naming/packaging structure that will make adopting and deploying Cloudera more straightforward.
Introducing Cloudera Standard
Helping users manage hundreds of configurations for the growing family of Apache Hadoop services has always been one of Cloudera Manager’s main goals. Prior to version 4.5, it was possible to set configurations at the service (e.g. hdfs), role type (e.g. all datanodes), or individual role level (e.g. the datanode on machine17). An individual role would inherit the configurations set at the service and role-type levels. Configurations made at the role level would override those from the role-type level. While this approach offers flexibility when configuring clusters, it was tedious to configure subsets of roles in the same way.
In Cloudera Manager 4.5, this issue is addressed with the introduction of role groups. For each role type, you can create role groups and assign configurations to them. The members of those groups then inherit those configurations. For example, in a cluster with heterogeneous hardware, a datanode role group can be created for each host type and the datanodes running on those hosts can be assigned to their corresponding role group. That makes it possible to tweak the configurations for all the datanodes running on the same hardware by modifying the configurations of one role group.
Have you ever wished you could upgrade to the latest CDH minor release with just a few mouse clicks, and even without taking any downtime on your cluster? Well, with Cloudera Manager 4.5 and its new “Parcel” feature, you can!
That release introduced many new features and capabilities related to parcels, and in this FAQ-oriented post, you will learn about most of them.
What are parcels?
One of the complexities of Apache Hadoop is the need to deploy clusters of servers, potentially on a regular basis. At Cloudera, which at any time maintains hundreds of test and development clusters in different configurations, this process presents a lot of operational headaches if not done in an automated fashion. In this post, I’ll describe an approach to cluster automation that works for us, as well as many of our customers and partners.
At Cloudera engineering, we have a big support matrix: We work on many versions of CDH (multiple release trains, plus things like rolling upgrade testing), and CDH works across a wide variety of OS distros (RHEL 5 & 6, Ubuntu Precise & Lucid, Debian Squeeze, and SLES 11), and complex configuration combinations — highly available HDFS or simple HDFS, Kerberized or non-secure, using YARN or MR1 as the execution framework, etc. Clearly, we need an easy way to spin-up a new cluster that has the desired setup, which we can subsequently use for integration, testing, customer support, demos, and so on.
As Cloudera’s keeper of customer stories, it’s dawned on me that others might benefit from the information I’ve spent the past year collecting: the many use cases and deployment patterns for Hadoop amongst our customer base.
This week I’d like to highlight Nokia, a global company that we’re all familiar with as a large mobile phone provider, and whose Senior Director of Analytics – Amy O’Connor – will be speaking at tomorrow’s Cloudera Sessions event in Boston.
Today Cloudera announced a new Cloudera Academic Partnership program, in which participating universities worldwide get access to curriculum, training, certification, and software.
As noted in the press release, the global demand for people with Apache Hadoop and data science skills is dwarfing all supply. We consider it an important mission to help accredited universities meet that demand, by equipping them with the content and training they need to educate students in the Hadoop arts.
Vagrant is a very nice tool for programmatically managing many virtual machines (VMs) on a single physical machine. It natively supports VirtualBox and also provides plugins for VMware Fusion and Amazon EC2, supporting the management of VMs in those environments as well.
Editor’s Note (added Feb. 25, 2015): For releases beyond 4.5, Cloudera recommends the use of Cloudera Director for deploying CDH in cloud environments.
Cloudera Manager includes a new express installation wizard for Amazon Web Services (AWS) EC2. Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the open source distributed query engine for Apache Hadoop) on EC2 as easily as possible (for testing and development purposes only, not supported for production workloads) - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.
Last week Cloudera released the 4.5 release of Cloudera Manager, the leading framework for end-to-end management of Apache Hadoop clusters. (Download Cloudera Manager here, and see install instructions here.) Among many other features, Cloudera Manager 4.5 adds support for Apache Hive. In this post, I’ll explain how to set up a Hive server for use with Cloudera Manager 4.5 (and later).
For details about other new features in this release, please see the full release notes:
It has been a while since I have blogged, primarily because we have been heads-down working toward the Cloudera Manager 4.5 release that we announced yesterday!
Cloudera Manager has seen a rapid adoption among enterprise customers and as more clusters are deployed into production environments, the more feature requests we get from them. We have heard our customers and the Cloudera Manager 4.5 release aims to address many of these requests. Kudos to the engineering team for another feature-packed release.
Today is an exciting day for Cloudera customers and users. With an update to our 100% open source platform and a number of new add-on products, every software component we ship is getting either a minor or major update. There’s a lot to cover and this blog post is only a summary. In the coming weeks we’ll do follow-on blog posts that go deeper into each of these releases.
You may have seen the recent announcement from Skytap about the availability of pre-configured CDH4 templates in the Skytap Cloud public template library. So for anyone who wants to try out a Cloudera Hadoop cluster—from small to large—it can now be easily accomplished in Skytap Cloud. The how-to below from Skytap’s Matt Sousely explains how.
The goal of this how-to will be to spin up a 10-node Cloudera Hadoop cluster in Skytap Cloud. To begin, let’s talk about the two new Cloudera Hadoop cluster templates. The first is Cloudera CDH4 Hadoop cluster: a 2-node Hadoop cluster template. It includes 2 nodes and a management node/server. The second is the Cloudera CDH4 Hadoop host template. This second template is not intended to run by itself in a configuration—rather, it contains a host VM that is ready to become another Hadoop node in the Cloudera CDH4 Hadoop cluster template-based configuration.
I am pleased to announce the release of Cloudera Impala Beta (version 0.4) and Cloudera Manager 4.1.3. Key enhancements in each release are:
Cloudera Impala Beta (version 0.4)
Because raising the visibility of Apache Hadoop use cases is so important, in this post we bring you a re-posted story about how and why Rapleaf, a marketing data company based in San Francisco, uses Cloudera Enterprise (CDH and Cloudera Manager).
Founded in 2006, Rapleaf’s mission is to make it incredibly easy for marketers to access the data they need so they can personalize content for their customers. Rapleaf helps clients “fill in the blanks” about their customers by taking contact lists and, in real time, providing supplemental data points, statistics and aggregate charts and graphs that are guaranteed to have greater than 90% accuracy. Rapleaf is powered by Cloudera.
Business Challenges Before Cloudera
With the availability of this new demo VM containing Cloudera Manager Free Edition and CDH4.1.2 on CentOS 6.2, getting quick hands-on experience with a freeze-dried single-node Apache Hadoop cluster is just a few minutes away after the download process.
This new addition to our growing Demo VM menagerie is available, as usual, in VMware, VirtualBox, and KVM flavors. A 64-bit host OS is required.
I am pleased to announce the release of Cloudera Impala Beta (version 0.3) and Cloudera Manager 4.1.2. Key enhancements in each release are:
Cloudera Impala Beta (version 0.3)