Cloudera Blog · CDH Posts
In the technology business, building a thriving and progressive user ecosystem around a platform is about as Mom-and-apple-pie as you can get. We all intuitively acknowledge that it’s one of the metrics for success.
Perhaps the most under-appreciated aspect of any platform ecosystem is the recognition that it is fundamentally built by real people. Without enthusiastic users of a platform engaging as evangelists on its behalf, the growth of the ecosystem around it will eventually slow to a crawl.
Cloudera Manager 4.5 includes a new express installation wizard for Amazon Web Services (AWS) EC2. (This feature is also available in Cloudera Manager Free Edition.) Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the new open source distributed query engine for Apache Hadoop) on EC2 as easily as possible - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.
The new distinguishing feature is that Cloudera Manager can now launch and configure the instances for you, so you don’t have to worry about launching the instances, authorizing SSH keys, and configuring a firewall. All this can now be done from within Cloudera Manager!
Since Cloudera Manager and the nodes running CDH use internal hostnames to communicate, the Cloudera Manager server must run on EC2 as well. In fact, the Cloud Express Wizard only appears when installing Cloudera Manager on EC2.
Hue is an open-source web interface for Apache Hadoop packaged with CDH that focuses on improving the overall experience for the average user. The Apache Oozie application in Hue provides an easy-to-use interface to build workflows and coordinators. Basic management of workflows and coordinators is available through the dashboards with operations such as killing, suspending, or resuming a job.
Prior to Hue 2.2 (included in CDH 4.2), there was no way to manage workflows within Hue that were created outside of Hue. As of Hue 2.2, importing a pre-existing Oozie workflow by its XML definition is now possible.
How to import a workflow
Importing a workflow is pretty straightforward. All it requires is the workflow definition file and access to the Oozie application in Hue. Follow these steps to import a workflow:
- Go to Oozie Editor/Dashboard > Workflows and click the “Import” button.
This guest post is provided by Rohit Menon, Product Support and Development Specialist at Subex.
I am a software developer in Denver and have been working with C#, Java, and Ruby on Rails for the past six years. Writing code is a big part of my life, so I constantly keep an eye out for new advances, developments, and opportunities in the field, particularly those that promise to have a significant impact on software engineering and the industries that rely on it.
In my current role working on revenue assurance products in the telecom space for Subex, I have regularly heard from customers that their data is growing at tremendous rates and becoming increasingly difficulty to process, often forcing them to portion out data into small, more manageable subsets. The more I heard about this problem, the more I realized that the current approach is not a solution, but an opportunity, since companies could clearly benefit from more affordable and flexible ways to store data. Better query capability on larger data sets at any given time also seemed key to derive the rich, valuable information that helps drive business. Ultimately, I was hoping to find a platform on which my customers could process all their data whenever they needed to. As I delved into this Big Data problem of managing and analyzing at mega-scale, it did not take long before I discovered Apache Hadoop.
Mission: Hands-On Hadoop
My initial reading about Hadoop on the various blogs and forums had me convinced that it is easily one of the best tools out there for handling and processing large volumes of data. At first, I thought I’d be able to learn Hadoop on my own by reading Hadoop: The Definitive Guide and the Hadoop Tutorial from Yahoo! However, after only a few days of reading, it became clear that I would benefit greatly from direct interaction with Hadoop experts, supervised experimentation, and interaction with practical examples of Hadoop challenges from the field.
The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.
Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.
In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.
Hadoop network encryption is a feature introduced in Apache Hadoop 2.0.2-alpha and in CDH4.1.
In this blog post, we’ll first cover Hadoop’s pre-existing security capabilities. Then, we’ll explain why network encryption may be required. We’ll also provide some details on how it has been implemented. At the end of this blog post, you’ll get step-by-step instructions to help you set up a Hadoop cluster with network encryption.
A Bit of History on Hadoop Security
Starting with Apache Hadoop 0.20.20x and available in Hadoop 1 and Hadoop 2 releases (as well as CDH3 and CDH4 releases), Hadoop supports Kerberos-based authentication. This is commonly referred to as Hadoop Security. When Hadoop Security is enabled it requires users to authenticate (using Kerberos) in order to read and write data in HDFS or to submit and manage MapReduce jobs. In addition, all Hadoop services authenticate with each other using Kerberos.
Hue lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features a file browser for HDFS, an Apache Oozie Application for creating workflows of data processing jobs, a job designer/browser for MapReduce, Apache Hive and Cloudera Impala query editors, a Shell, and a collection of Hadoop APIs.
The goal of this release was to add a set of new features and improve the user experience. Read on for a list of the major changes (from 304 commits).
Today is an exciting day for Cloudera customers and users. With an update to our 100% open source platform and a number of new add-on products, every software component we ship is getting either a minor or major update. There’s a lot to cover and this blog post is only a summary. In the coming weeks we’ll do follow-on blog posts that go deeper into each of these releases.
We’re now supporting several hundred production Hadoop clusters. In doing so we’ve had to make a lot of advances in the functionality, reliability and manageability of the Hadoop platform. Even with these improvements, customers have been traditionally reluctant to run certain data and applications on the Apache Hadoop platform. The new products we are announcing today were designed to remove these obstacles to adoption.
In my previous post, you learned how to write a basic MapReduce job and run it on Apache Hadoop. In this post, we’ll delve deeper into MapReduce programming and cover some of the framework’s more advanced features. In particular, we’ll explore:
The following is a series of stories from people who in the recent past worked as Engineering Interns at Cloudera. These experiences concretely illustrate how collaboration between commercial companies like Cloudera and academia, such as in the form of these internships, helps promote big data research at universities. (These experiences were previously published in the ACM student journal, XRDS.)
Yanpei Chen (Intern 2011)
I Interned with Cloudera during my last summer of grad school. My dissertation was on “Workload Driven Design and Evaluation of Large-Scale Data-Centric Systems”, and I already had collaborations with Facebook and NetApp, two other big data companies. The goal of my work was to develop and demonstrate a set of empirical, workload-driven design and evaluation methods that complemented the traditional, subjective approach of designing by intuition and experience. It was very important that these methods generalized across many types of customer workloads. Hence, when Cloudera offered me an internship, I leapt at the unique opportunity to collect insights from customers in traditional industries who were still dealing with big data.