Cloudera Engineering Blog · Training Posts

New Cloudera Search Training: Learn Powerful Techniques for Full-Text Search on an EDH

Cloudera Search combines the speed of Apache Solr with the scalability of CDH. Our newest training course covers this exciting technology in depth, from indexing to user interfaces, and is ideal for developers, analysts, and engineers who want to learn how to effectively search both structured and unstructured data at scale.

Despite being nearly 10 years old, Apache Hadoop already has an interesting history. Some of you may know that it was inspired by the Google File System and MapReduce papers, which detailed how the search giant was able to store and process vast amounts of data. Search was the original Big Data application, and, in fact, Hadoop itself was a spinoff of a project designed to create a reliable, scalable system to index data using one of Doug Cutting’s other creations: Apache Lucene.

New Apache Spark Developer Training: Beyond the Basics

While the new Spark Developer training from Cloudera University is valuable for developers who are new to Big Data, it’s also a great call for MapReduce veterans.

When I set out to learn Apache Spark (which ships inside Cloudera’s open source platform) about six months ago, I started where many other people do: by following the various online tutorials available from UC Berkeley’s AMPLab, the creators of Spark. I quickly developed an appreciation for the elegant, easy-to-use API and super-fast results, and was eager to learn more.

Meet the Data Scientist: Alan Paulsen

Meet Alan Paulsen, among the first to earn the CCP: Data Scientist distinction.

Big Data success requires professionals who can prove their mastery with the tools and techniques of the Apache Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address the Big Data talent gap with the Cloudera Certified Professional (CCP) program.

New Training: Design and Build Big Data Applications

Cloudera’s new “Designing and Building Big Data Applications” is a great springboard for writing apps for an enterprise data hub.

Cloudera’s vision of an enterprise data hub as a central, scalable repository for all your data is changing the notion of data warehousing. The best way to gain value from all of your data is by bringing more workloads to where the data lives. That place is Apache Hadoop.

Meet the Data Scientist: Stuart Horsman

Meet Stuart Horsman, among the first to earn the CCP: Data Scientist distinction.

Big Data success requires professionals who can prove their mastery with the tools and techniques of the Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address the Big Data talent gap with the Cloudera Certified Professional (CCP) program.

Meet the Data Scientist: David F. McCoy

Meet David F. McCoy, one of the first to have earned the title “CCP: Data Scientist” from Cloudera University.

Big Data success requires professionals who can prove their mastery with the tools and techniques of the Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address the Big Data talent gap with the Cloudera Certified Professional (CCP) program.

Meet the Instructor: Bruce Martin

In this installment of “Meet the Instructor”, our interview subject is Bruce Martin.

What is your role at Cloudera?

NYU, Analytics, and Cloudera’s QuickStart VM

The Cloudera QuickStart VM is an important platform for learning any Hadoop-related curriculum.

In the Fall 2013 semester, more than 30 NYU graduate students completed the Real-time and Big Data Analytics course at the NYU Courant Institute of Mathematical Sciences, for which I served as instructor.

HBase Training: Demystifying Real-Time Big Data Storage

We at Cloudera University have been busy lately, building and expanding our courses to help data professionals succeed. We’ve expanded the Hadoop Administrator course and created a new Data Analyst course. Now we’ve updated and relaunched our course on Apache HBase to help more organizations adopt Hadoop’s real-time Big Data store as a competitive advantage.

The course is designed to make sure developers and administrators with an HBase use case can start realizing value from day one. We doubled the length of the curriculum to four days, allowing a deep dive into HBase operations as well as development.

Hadoop Administrator Training Gets Hands-On

I’ve always held a strong bias that education is most effective when the student learns by doing. As a developer of technical curricula, my goal is to have training participants engage with real and relevant problems as much as possible through hands-on exercises. The high rate at which Apache Hadoop is changing, both as a technology and as an ecosystem, makes developing Cloudera training courses not only demanding but also seriously fun and rewarding.

I recently undertook the challenge of upgrading the Cloudera Administrator Training for Apache Hadoop. I more than quadrupled the amount of hands-on exercises from the previous version, adding a full day to the course. At four days, it’s now the most thorough training for Hadoop administrators and truly the best way to start building expertise.

Get Hired as a Certified Data Scientist

To paraphrase Nate Silver: “There is lots of data coming. Who will speak for all this data?”

Nearly every day, I read new articles about how Big Data is “changing everything.” Data scientists are unlocking new approaches that help researchers find the cure for cancer, banks fight fraud, the police fight drug-related crimes, and fantasy sports leaguers fight each other.

Meet the Instructor: Nathan Neff

In this installment of “Meet the Instructor,” we speak to St. Louis-based Nathan Neff, the Training Lead for Cloudera’s new Data Analyst course. 

What is your role at Cloudera?

New E-Learning for Parcels

Cloudera’s new Parcels installation format has been released, and I’m excited to highlight just how useful (and mind-blowingly cool) it is to system administrators and anyone responsible for maintaining a CDH cluster.

If you haven’t read about or played with Parcels, they make components of the distribution significantly easier to manage, install, and upgrade. The new Parcel distribution format works with Cloudera Manager 4.5 and later. When you perform installations and upgrades using Parcels, you get access to new Cloudera Manager features such as:

QuickStart VM: Now with Real-Time Big Data

For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.

We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.

Make Hadoop Your Best Business Tool

Data analysts and business intelligence specialists have been at the heart of new trends driving business growth over the past decade, including log file and social media analytics. However, Big Data heretofore has been beyond the reach of analysts because traditional tools like relational databases don’t scale, and scalable systems like Apache Hadoop have historically required Java expertise. 

Cloudera Academic Partnership Program: Creating Hadoop Lovers in Universities Worldwide

Today Cloudera announced a new Cloudera Academic Partnership program, in which participating universities worldwide get access to curriculum, training, certification, and software. 

As noted in the press release, the global demand for people with Apache Hadoop and data science skills is dwarfing all supply. We consider it an important mission to help accredited universities meet that demand, by equipping them with the content and training they need to educate students in the Hadoop arts.

How Persado Supports Persuasion Marketing Technology with Data Analyst Training

This guest post comes from Alex Giamas, Senior Software Engineer on the data warehouse team at Persado, an ultra-hot persuasion marketing technology company with operations in Athens, Greece.

A World-Class EDW Requires a World-Class Hadoop Team

Persado is the global leader in persuasion marketing technology, a new category in digital marketing. Our revolutionary technology maps the genome of marketing language and generates the messages that work best for any customer and any product at any time. To assure the highest quality experience for both our clients and end-users, our engineering team collaborates with Ph.D. statisticians and data analysts to develop new ways to segment audiences, discover content, and deliver the most relevant and effective marketing messages in real time.

Video Premiere: Training a New Generation of Data Scientists

Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.

Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.

Apache Hadoop Developer Training Helps Query Massive Telecom Data

This guest post is provided by Rohit Menon, Product Support and Development Specialist at Subex.

I am a software developer in Denver and have been working with C#, Java, and Ruby on Rails for the past six years. Writing code is a big part of my life, so I constantly keep an eye out for new advances, developments, and opportunities in the field, particularly those that promise to have a significant impact on software engineering and the industries that rely on it. 

In my current role working on revenue assurance products in the telecom space for Subex, I have regularly heard from customers that their data is growing at tremendous rates and becoming increasingly difficulty to process, often forcing them to portion out data into small, more manageable subsets. The more I heard about this problem, the more I realized that the current approach is not a solution, but an opportunity, since companies could clearly benefit from more affordable and flexible ways to store data. Better query capability on larger data sets at any given time also seemed key to derive the rich, valuable information that helps drive business. Ultimately, I was hoping to find a platform on which my customers could process all their data whenever they needed to. As I delved into this Big Data problem of managing and analyzing at mega-scale, it did not take long before I discovered Apache Hadoop.

Mission: Hands-On Hadoop

Meet the Instructor: Glynn Durham

In this installment of “Meet the Instructor,” we speak to San Francisco-based Glynn Durham, one of the big brains behind Cloudera’s Introduction to Data Science training and certification. 

What is your role at Cloudera?
I am a Senior Instructor with Cloudera University, which means I am a road warrior: I will travel anywhere to teach anything to anyone. I teach all the courses Cloudera offers, including custom private training events that I run at customer sites. Right now, I’m especially enjoying teaching Cloudera’s new course, Introduction to Data Science: Building Recommender Systems. In tandem with the rollout of the course, we’re developing Cloudera Certified Professional: Data Scientist exams, which will include a challenging performance-based lab component in addition to the written test.

How Syncsort Leverages Training to Optimize Hadoop Scalability

This guest post is provided by Dave Nahmias, Pre-Sales and Partner Solutions Engineer at Syncsort, with an introduction by Patty Crowell, Director of Global Education Services at Syncsort.

Introduction: Training is Key

Apache Hadoop is extremely important to maximizing the value Syncsort’s technology delivers to our customers. That value promise starts with a solid foundation of knowledge and skills among key technical staff across the company.

Webinar: Introduction to Hadoop Developer Training (Jan. 31)

Are you new to Apache Hadoop and need to start processing data fast and effectively? Have you been playing with CDH and are ready to move on to development supporting a technical or business use case? Are you prepared to unlock the full potential of all your data by building and deploying powerful Hadoop-based applications?

Save 15% on Multi-Course Public Training Enrollments in January and February

Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production.  We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.

But did you know Cloudera training can also help you plan for the advanced stages and progress of your Hadoop cluster? In addition to core training for Developers and Administrators, we also offer the best (and, in some cases, only) opportunity to get up to speed on lifecycle projects within the Hadoop ecosystem in a classroom setting. Cloudera University’s course offerings go beyond the basics to include Training for Apache HBase, Training for Apache Hive and Pig, and Introduction to Data Science: Building Recommender Systems. Depending on your Big Data agenda, Cloudera training can help you increase the accessibility and queryability of your data, push your data performance towards real-time, conduct business-critical analyses using familiar scripting languages, build new applications and customer-facing products, and conduct data experiments to improve your overall productivity and profitability.

Meet the Instructor: Jesse Anderson

Jesse Anderson The Hadoop Community is an invariably fascinating world.  After all, as Clouderan ATM put it in a past blog post, the user group meetups are adorably called “HUGs.” Just as the Cloudera blog has introduced you to some of the engineers, projects, and applications that serve as the head, heart, and hands of the Hadoop Community, we’re proud to add the circulatory system (to extend the metaphor), made up of Cloudera’s expert trainers and curriculum developers who bring Hadoop to new practitioners around the world every week.

Welcome to the first installment of our “Meet the Instructor” series, in which we briefly introduce you to some of the individuals endeavoring to teach Hadoop far and wide. Today, we speak to Jesse Anderson (@jessetanderson)! 

Get a Free Hadoop Operations Ebook with Administrator Training

Start the year off with bigger questions by taking advantage of Cloudera University’s special offer for aspiring Hadoop administrators. All participants who complete a Cloudera Administrator Training for Apache Hadoop public course by the end of March 2013 will receive a free digital copy of Hadoop Operations by Eric Sammer. If you’ve been asked to maintain large and complex Hadoop clusters, this book is a must. In addition to providing practical guidance from an expert, Hadoop Operations is also a terrific companion reference to the full Cloudera Administrator course.

Cloudera’s three-day course provides administrators a comprehensive understanding of all the steps necessary to operate and manage Hadoop clusters. From installation and configuration through load balancing and tuning your cluster, Cloudera’s administration course has you covered. This course is appropriate for system administrators and others who will be setting up or maintaining a Hadoop cluster. Basic Linux experience is a prerequisite, but prior knowledge of Hadoop is not required.

Introducing Cloudera CDH4 Certification

We are very pleased to introduce new, CDH4.1-aligned versions of the Cloudera Certified Developer for Apache Hadoop and Cloudera Certified Administrator for Apache Hadoop exams.

To celebrate, we’re offering a steep 40% discount on the new exams until the end of the year! Just use the promotion code CDH4 when you register to take the CCD-410 or CCA-410 exam through Pearson VUE before Dec. 31, 2012.

This Month in Data Science

Data science has been a ubiquitous topic of conversation in the IT and business worlds across the month of November. In this brief post, I’ll bring you just a small cross-section of the data science meme on the Interwebs in the past 4 weeks:

Training a New Generation of Data Scientists

Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.

Why is Cloudera offering data science training?

The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems. Addressing that bottleneck with training has always been a very large part of our mission here at Cloudera, and we are very fortunate to have one of the best training teams anywhere. So far, we have trained over 15,000 Hadoop developers and administrators, and our courses and certification exams are available all over the world.

Apache Hadoop on Your PC: Cloudera’s CDH4 Virtual Machine

Today ZDNet has very helpfully published a guide to downloading, configuring, and using Cloudera’s Demo VM for CDH4 (available in three flavors, but in this case the VMware version). As the author, Andrew Brust, explains, the VM contains a “pre-built, training-appropriate, 1-node Apache Hadoop cluster” (on top of CentOS). Perhaps most important for boot-strappers, it’s free.

You can download the VM here - and there is a Hadoop tutorial available here. The combo will go a long way toward jump-starting explorations. Thanks, ZDNet!

Hadoop World 2011: A Glimpse into Development

The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.

Preview of Development Track Sessions

Cloudera Certification for Apache Hadoop at Hadoop Summit

Take advantage of the opportunity to become a Cloudera Certified Developer or Administrator for Apache Hadoop the day before Hadoop Summit, June 28th. This is the first time these certifications have been offered apart from their respective courses – so don’t miss the chance to validate your Hadoop expertise!

There are several exam times throughout the day for your convenience. The Developer exam lasts for 90 minutes, the Administrator exam for 60 minutes.

Become a Cloudera Certified Developer

Cloudera Training for Apache Hadoop Surrounding Hadoop Summit 2011

Cloudera is offering several training courses for Apache Hadoop over the dates surrounding Hadoop Summit. There are five different courses in all spanning the dates of June 27th to July 1st. Three of these courses are specifically designed to provide the necessary knowledge for a robust overall understanding of Hadoop and they tackle the “elephant” from several perspectives apache hadoop— developer, system administrator, and managerial. The other two training sessions focus on projects within the Hadoop ecosystem; namely Hive, Pig, and HBase.

Cloudera Developer Bootcamp for Apache Hadoop is a two-day course designed for developers who wish to learn the MapReduce framework and how to write programs against its API. The course covers similar material to our standard three-day Developer training, but has been condensed into two intensive days with extended course hours. At the end of the course, attendees have the opportunity to take an exam which, if passed, confers the Cloudera Certified Hadoop Developer credential.

Upcoming Apache Hadoop Training Sessions

As interest in Hadoop continues to grow, we continue to make available public training sessions to accommodate. Cloudera training sessions are always evolving to stay current with Hadoop technology as the open source community continues to fine tune and improve Hadoop and its surrounding ecosystem.

Cloudera provides training sessions tailored toward Developers, Administrators and Managers for Hadoop, HBase, Hive, Pig and Hue. The Hadoop Developer and Sysadmin training course includes the certification exam to become a Cloudera Certified Hadoop Developer.

Lessons Learned from Cloudera’s Hadoop Developer Training Course

This is a guest post from an attendee of our Hadoop Developer Training course, Attila Csordas, bioinformatician at the European Bioinformatics Institute, Hinxton, Cambridge, UK.

As a wet lab biologist turned bioinformatician I have ~2 year programming experience, mainly in Perl and have been working with Java for the last 9 months. A bioinformatician is not a developer so I’m writing easy code in just a fraction of my work time: parsers, db connections, xml validators, little bug fixes, shell scripts. On the other hand, I have now 5 months of Hadoop experience – and a 6 month old baby named Alice – and that experience is as immense as it gets. Ever since I read the classic Dean-Ghemawat paper, MapReduce: Simplified Data Processing on Large Clusters, I’m thinking about bioinformatics problems in terms of Map and Reduce functions (especially during my evening jog), then implementing these ideas in my free time–which consists of feeding the baby, writing code, changing the nappy, rewriting code.

New York Training Session for Managers Interested In Hadoop

Hadoop Essentials for Managers is a one-day course provided October 11th—the day prior to Hadoop World—that will provide decision-makers with the information they need about Apache Hadoop. In this session we will answer questions such as:

Register for Hadoop Training in New York and Get into Hadoop World for Free!

That’s right, sign up for any of the training courses surrounding Hadoop World 2010, and receive a complimentary pass to the conference! There are seven different courses on offer, so whether you are new to Hadoop or looking to deepen your skills, you’ll find something to fit your needs.

If you are a manager trying to decide whether Hadoop is an appropriate technology for your organization, Hadoop Essentials for Managers will answer your questions. We will show you when using Hadoop is appropriate, what Hadoop is being used for in a range of industries, how Hadoop fits into your existing environment and what you need to know in order to deploy it within your organization.

Hadoop Administrator Training Comes to London

Cloudera’s Apache Hadoop Training and Certification for System Administrators has made it across the Atlantic to London for the first time! This two-day course covers planning, deploying, maintaining, monitoring, and troubleshooting your Hadoop cluster. We’ll talk about HDFS, MapReduce, Apache Hive, Apache Pig, Apache HBase, Flume and more, from the System Administrator’s point of view. Take the certification exam at the end of your training and go home with a valuable validation of your Hadoop knowledge.

Enter the code “london_10pct” when registering and receive a 10% discount!

Hadoop World: NYC – Training

Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.

We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.

Exciting new Hadoop Training Offerings from Cloudera

Around the globe, more and more companies are turning to Hadoop to tackle data processing problems that don’t lend themselves well to traditional systems. Users in the community consistently ask us to offer training in more places and expand our course offerings, and those who have obtained certification have reported great success connecting with companies investing in Hadoop. All of this keeps us pretty excited about the long term prospects for Hadoop.

We recently announced our first international developer training sessions in Tokyo (sold out, waitlist available) and Taiwan, and we’re happy to follow up with sessions in the EU. We’ll be visiting London the first week of June, and Berlin the next. If you’ll be in Berlin that week, be sure to check out the Berlin Buzzwords conference – a two day event focused on Hadoop, Lucene, and NoSQL.

Get Hadoop Training from Cloudera at the Hadoop Summit

We love getting together with other Hadoop fans and fanatics! We’ve put together new training offerings for this years upcoming Hadoop Summit in June, and we’ve worked out a special deal with Yahoo! to waive the conference registration fee for anyone who attends a Cloudera training session at the 2010 Hadoop Summit (you’ll get a discount code for training in your conference registration confirmation). In addition to our developer certification course, we’ll offer an extended version of our Systems Administration course, as well as new, full-day course on HBase. One particularly exciting new offering is our full-day course on Hive, which opens Hadoop up to anyone who knows SQL.

All of these offerings are driven by direct customer feedback about what their organizations need to be even more successful with Hadoop, and we’re excited to help. We look forward to seeing you there.

Cloudera’s Apache Hadoop Training Programs Expand Internationally

It’s been over a year now since we started offering Hadoop training in the Bay Area, and since then, we’ve put many of our introductory materials online (for free), and offer in-person public classes in cities around the US (click here for a full list of sessions). The response has been incredible, but one thing is painfully obvious: we’re not doing enough to meet the needs of the growing world-wide Apache Hadoop community.

To that end, we’ve made investments in translating translating our materials into new languages and thinking about how to scale our training programs internationally.

Hadoop World: NYC 2009

To say we were surprised by the quality and quantity of submissions we received for Hadoop World: NYC 2009 would be an understatement. We were amazed at how many “normal” companies have come to use Hadoop for everything ranging from business intelligence to protein alignment. It’s truly exciting to see how a system originally designed to process and index the web has evolved to support the data-driven workloads of so many industries.

It’s with great joy that we invite you to come learn about what the following companies have done with Hadoop: About.com, Booz Allen Hamilton, China Mobile, ContextWeb, eBay, Facebook, IBM, Intel, JPMC, Microsoft, The New York Times, NexR, Rackspace, Vertica, Visa, Visible Measures, Yale, and Yahoo!

Running the Cloudera Training VM in VirtualBox

Update (May 1 2013): The post below, which is based on an outdated VM, is deprecated. Rather please see the Cloudera QuickStart VM, which runs on VirtualBox, VMware, and KVM.

Cloudera’s Training VM is one of the most popular resources on our website. It was created with VMware Workstation, and plays nicely with the VMware Player for Windows, Linux, and Mac. But VMware isn’t for everyone. Thomas Lockney has managed to get our VM image running on Virtual Box, and has written a step-by-step guide for the community. Thanks Thomas! – Christophe

Announcing Cloudera Certification for Apache Hadoop

As Apache Hadoop continues to turn heads at startups and big enterprises alike, Cloudera has received several requests to offer certification in addition to our popular training programs.

Certification is a critical component of any software ecosystem, and especially so for open source projects with quickly expanding user bases. Certification allows developers to ensure their skills are up to date, and allows employers and customers to confidently identify individuals that are up for the challenge of solving problems with Hadoop.

Apache Pig Training Now Available Online

Today I did a web search for “pig training” using my favorite search engine. I was wildly entertained by the results, and have embedded my favorite for your viewing pleasure.

Configuring Eclipse for Apache Hadoop Development (a screencast)

Update (added 5/15/2013): The information below is dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.

One of the perks of using Java is the availability of functional, cross-platform IDEs.  I use vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.