Cloudera Developer Blog · CDH Posts
As Cloudera’s keeper of customer stories, it’s dawned on me that others might benefit from the information I’ve spent the past year collecting: the many use cases and deployment patterns for Hadoop amongst our customer base.
This week I’d like to highlight Nokia, a global company that we’re all familiar with as a large mobile phone provider, and whose Senior Director of Analytics – Amy O’Connor – will be speaking at tomorrow’s Cloudera Sessions event in Boston.
Fun fact: Nokia has been in business for more than 150 years, starting with the production of paper in the 1800s. When I first met Amy O’Connor in early 2012, she explained to me that Nokia has always been in the business of transforming resources into useful products — from paper and rubber over a century ago, to the electronics and mobile devices we’re familiar with today.
Today Cloudera announced a new Cloudera Academic Partnership program, in which participating universities worldwide get access to curriculum, training, certification, and software.
As noted in the press release, the global demand for people with Apache Hadoop and data science skills is dwarfing all supply. We consider it an important mission to help accredited universities meet that demand, by equipping them with the content and training they need to educate students in the Hadoop arts.
Furthermore, we are cognizant of the fact that many academic research labs are in need of tools to help deploy, manage, and extend Hadoop clusters. For that reason, CAP members get free access to Cloudera Manager Enterprise Edition for 12 months to support data-intensive testing, development, and research.
A World-Class EDW Requires a World-Class Hadoop Team
Persado is the global leader in persuasion marketing technology, a new category in digital marketing. Our revolutionary technology maps the genome of marketing language and generates the messages that work best for any customer and any product at any time. To assure the highest quality experience for both our clients and end-users, our engineering team collaborates with Ph.D. statisticians and data analysts to develop new ways to segment audiences, discover content, and deliver the most relevant and effective marketing messages in real time.
Given the challenge of creating a market based on ongoing data collection and massive query ability, the data warehouse organization ultimately plays the most important role in the persuasion marketing value chain, assuring a steady and unobstructed multidirectional flow of information. My team continuously ensures Persado’s infrastructure is aligned to the needs of our data scientists, including regularly generating KPI reports, managing data from heterogeneous sources, preparing customized analyses, and even implementing specific statistical algorithms in Java based on reference implementations of R.
It’s only Rock and Roll, but I like it!
– Mick Jagger
Copyright is having a tough time in the digital age. New copies of music, movies and software can be created at near zero cost. Some wonder whether it still makes sense to ever charge for content. Over the past century large industries have developed that sell content. These industries resist change. We consumers love our content, but don’t love paying for it. But would all the content we love still exist without payment for copyright?
One solution might be to replace sales of copyrighted material with services that provide access to the content. We could buy tickets to concerts, a service provided by musicians. We could enjoy streaming music services via subscriptions or supported by advertising. Similarly, we could access software as a service in the cloud. Some companies, like Google, create proprietary software yet don’t make money off it by selling its copyright, but only through services. That’s a great model, but is it the only way forward?
This guest post comes to us from David Greco, CTO of Eligotech.
Vagrant is a very nice tool for programmatically managing many virtual machines (VMs) on a single physical machine. It natively supports VirtualBox and also provides plugins for VMware Fusion and Amazon EC2, supporting the management of VMs in those environments as well.
Vagrant provides a very easy-to-use, Ruby-based internal DSL that allows the user to define one or more virtual machines together with their configuration parameters. Furthermore, it offers different mechanisms for automatic provisioning: You can use Puppet, Chef, or shell scripts for automating software installation and configuration on the machines defined in the Vagrant configuration file.
In the technology business, building a thriving and progressive user ecosystem around a platform is about as Mom-and-apple-pie as you can get. We all intuitively acknowledge that it’s one of the metrics for success.
Perhaps the most under-appreciated aspect of any platform ecosystem is the recognition that it is fundamentally built by real people. Without enthusiastic users of a platform engaging as evangelists on its behalf, the growth of the ecosystem around it will eventually slow to a crawl.
Cloudera Manager includes a new express installation wizard for Amazon Web Services (AWS) EC2. Its goal is to enable Cloudera Manager users to provision CDH clusters and Cloudera Impala (the open source distributed query engine for Apache Hadoop) on EC2 as easily as possible (for testing and development purposes only, not supported for production workloads) - and thus is currently the fastest way to provision a Cloudera Manager-managed cluster in EC2.
The new distinguishing feature introduced in version 4.5 is that Cloudera Manager can now launch and configure the instances for you, so you don’t have to worry about launching the instances, authorizing SSH keys, and configuring a firewall. All this can now be done from within Cloudera Manager!
Since Cloudera Manager and the nodes running CDH use internal hostnames to communicate, the Cloudera Manager server must run on EC2 as well. In fact, the Cloud Express Wizard only appears when installing Cloudera Manager on EC2.
Hue is an open-source web interface for Apache Hadoop packaged with CDH that focuses on improving the overall experience for the average user. The Apache Oozie application in Hue provides an easy-to-use interface to build workflows and coordinators. Basic management of workflows and coordinators is available through the dashboards with operations such as killing, suspending, or resuming a job.
Prior to Hue 2.2 (included in CDH 4.2), there was no way to manage workflows within Hue that were created outside of Hue. As of Hue 2.2, importing a pre-existing Oozie workflow by its XML definition is now possible.
How to import a workflow
Importing a workflow is pretty straightforward. All it requires is the workflow definition file and access to the Oozie application in Hue. Follow these steps to import a workflow:
- Go to Oozie Editor/Dashboard > Workflows and click the “Import” button.
This guest post is provided by Rohit Menon, Product Support and Development Specialist at Subex.
I am a software developer in Denver and have been working with C#, Java, and Ruby on Rails for the past six years. Writing code is a big part of my life, so I constantly keep an eye out for new advances, developments, and opportunities in the field, particularly those that promise to have a significant impact on software engineering and the industries that rely on it.
In my current role working on revenue assurance products in the telecom space for Subex, I have regularly heard from customers that their data is growing at tremendous rates and becoming increasingly difficulty to process, often forcing them to portion out data into small, more manageable subsets. The more I heard about this problem, the more I realized that the current approach is not a solution, but an opportunity, since companies could clearly benefit from more affordable and flexible ways to store data. Better query capability on larger data sets at any given time also seemed key to derive the rich, valuable information that helps drive business. Ultimately, I was hoping to find a platform on which my customers could process all their data whenever they needed to. As I delved into this Big Data problem of managing and analyzing at mega-scale, it did not take long before I discovered Apache Hadoop.
Mission: Hands-On Hadoop
My initial reading about Hadoop on the various blogs and forums had me convinced that it is easily one of the best tools out there for handling and processing large volumes of data. At first, I thought I’d be able to learn Hadoop on my own by reading Hadoop: The Definitive Guide and the Hadoop Tutorial from Yahoo! However, after only a few days of reading, it became clear that I would benefit greatly from direct interaction with Hadoop experts, supervised experimentation, and interaction with practical examples of Hadoop challenges from the field.
The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.
Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.
In contrast, HBase snapshots allow an admin to clone a table without data copies and with minimal impact on Region Servers. Exporting the snapshot to another cluster does not directly affect any of the Region Servers; export is just a distcp with an extra bit of logic.