Cloudera Developer Blog · Training Posts
We at Cloudera University have been busy lately, building and expanding our courses to help data professionals succeed. We’ve expanded the Hadoop Administrator course and created a new Data Analyst course. Now we’ve updated and relaunched our course on Apache HBase to help more organizations adopt Hadoop’s real-time Big Data store as a competitive advantage.
The course is designed to make sure developers and administrators with an HBase use case can start realizing value from day one. We doubled the length of the curriculum to four days, allowing a deep dive into HBase operations as well as development.
As the primary course author, I had the pleasure of interviewing some of the most notable members of the HBase community. People like Michael Stack, Lars George, and Amandeep Khurana have written the books, contributed code, and deployed and supported huge clusters in production. I also tried to capture many of the key insights that otherwise only exist in HBase’s tribal knowledge, some of which I discuss in my recent blog posts on the REST Interface and the Thrift Interface, as well as in the Simple User Access chapter of the Apache HBase Reference Guide.
Beyond the Tribe
I’ve always held a strong bias that education is most effective when the student learns by doing. As a developer of technical curricula, my goal is to have training participants engage with real and relevant problems as much as possible through hands-on exercises. The high rate at which Apache Hadoop is changing, both as a technology and as an ecosystem, makes developing Cloudera training courses not only demanding but also seriously fun and rewarding.
I recently undertook the challenge of upgrading the Cloudera Administrator Training for Apache Hadoop. I more than quadrupled the amount of hands-on exercises from the previous version, adding a full day to the course. At four days, it’s now the most thorough training for Hadoop administrators and truly the best way to start building expertise.
While developing the course, I collaborated with some of the most knowledgeable Hadoop administrators I could find, including Eric Sammer, Amandeep Khurana, Kathleen Ting, Romain Rigaux, and many other smart folks at Cloudera. The upgrades to the curriculum and exercises are based on best practices used to resolve our customers’ biggest problems. These insights resonate throughout the course, including the determination that administrators should learn installation, configuration, maintenance, monitoring, and troubleshooting using the standard Hadoop tools. Although we certainly hope that Hadoop users take advantage of Cloudera Manager to simplify and streamline many of these tasks, we believe that every good administrator needs to first take a look under the hood and tinker with Hadoop’s guts. There’s no replacement for get-your-hands-dirty experience to achieve expertise.
Cluster in the Cloud
To paraphrase Nate Silver: “There is lots of data coming. Who will speak for all this data?”
Nearly every day, I read new articles about how Big Data is “changing everything.” Data scientists are unlocking new approaches that help researchers find the cure for cancer, banks fight fraud, the police fight drug-related crimes, and fantasy sports leaguers fight each other.
It seems like all I need is an analytics platform like Apache Hadoop and a big pile of data, and actionable insights will just leap out at me, right? Well… not quite. Hadoop makes the difficult easy and the impossible merely difficult. However, we still have to know what we’re looking for and, once we’ve found it, understand what the results mean.
In this installment of “Meet the Instructor,” we speak to St. Louis-based Nathan Neff, the Training Lead for Cloudera’s new Data Analyst course.
What is your role at Cloudera?
I’m an instructor teaching almost all of Cloudera’s curricula: Developer, Administrator, Data Analyst, HBase, and Hadoop Essentials. I’m currently gearing up to start delivering Cloudera’s Introduction to Data Science training, which, from an instructor’s perspective, is a pretty exciting challenge. Most of the classes I teach are live and in-person, but I’ve also recorded screencasts and helped design multimedia courseware for Cloudera’s customers, which was a lot of fun.
Cloudera’s new Parcels installation format has been released, and I’m excited to highlight just how useful (and mind-blowingly cool) it is to system administrators and anyone responsible for maintaining a CDH cluster.
If you haven’t read about or played with Parcels, they make components of the distribution significantly easier to manage, install, and upgrade. The new Parcel distribution format works with Cloudera Manager 4.5 and later. When you perform installations and upgrades using Parcels, you get access to new Cloudera Manager features such as:
For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.
We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.
Cloudera Search integrates Apache Solr with the rest of the platform, to let you do full-text search of the data stored in your cluster, just like you would with an online search-engine! Cloudera Impala, on the other hand, lets you execute SQL queries against that same data, on the same platform, and get results back fast enough to interactively explore and analyze. With both these workloads available on the cluster, it eliminates the pain of having to move large data sizes around.
Data analysts and business intelligence specialists have been at the heart of new trends driving business growth over the past decade, including log file and social media analytics. However, Big Data heretofore has been beyond the reach of analysts because traditional tools like relational databases don’t scale, and scalable systems like Apache Hadoop have historically required Java expertise.
Today, the rise of new ecosystem tools is rapidly broadening the community using Hadoop and Big Data. Projects like Cloudera Impala and Apache Hive and Apache Pig have for the first time made Big Data accessible to those with traditional analytics backgrounds. With the launch of Data Analyst Training, Cloudera is helping the world’s analysts prove there’s nothing traditional about data analytics and BI on Hadoop.
The Democratization of Big Data
Today Cloudera announced a new Cloudera Academic Partnership program, in which participating universities worldwide get access to curriculum, training, certification, and software.
As noted in the press release, the global demand for people with Apache Hadoop and data science skills is dwarfing all supply. We consider it an important mission to help accredited universities meet that demand, by equipping them with the content and training they need to educate students in the Hadoop arts.
Furthermore, we are cognizant of the fact that many academic research labs are in need of tools to help deploy, manage, and extend Hadoop clusters. For that reason, CAP members get free access to Cloudera Manager Enterprise Edition for 12 months to support data-intensive testing, development, and research.
A World-Class EDW Requires a World-Class Hadoop Team
Persado is the global leader in persuasion marketing technology, a new category in digital marketing. Our revolutionary technology maps the genome of marketing language and generates the messages that work best for any customer and any product at any time. To assure the highest quality experience for both our clients and end-users, our engineering team collaborates with Ph.D. statisticians and data analysts to develop new ways to segment audiences, discover content, and deliver the most relevant and effective marketing messages in real time.
Given the challenge of creating a market based on ongoing data collection and massive query ability, the data warehouse organization ultimately plays the most important role in the persuasion marketing value chain, assuring a steady and unobstructed multidirectional flow of information. My team continuously ensures Persado’s infrastructure is aligned to the needs of our data scientists, including regularly generating KPI reports, managing data from heterogeneous sources, preparing customized analyses, and even implementing specific statistical algorithms in Java based on reference implementations of R.
Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.
Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.