New Apache Spark Developer Training: Beyond the Basics

Categories: Spark Training

While the new Spark Developer training from Cloudera University is valuable for developers who are new to Big Data, it’s also a great call for MapReduce veterans.

When I set out to learn Apache Spark (which ships inside Cloudera’s open source platform) about six months ago, I started where many other people do: by following the various online tutorials available from UC Berkeley’s AMPLab, the creators of Spark. I quickly developed an appreciation for the elegant, easy-to-use API and super-fast results, and was eager to learn more.

Unfortunately, that proved harder than I expected. It was easy to pick up the basic syntax for working with Spark’s core abstraction, Resilient Distributed Datasets (RDDs). But in order to be able to use Spark to solve real-world problems, I needed a deeper understanding, I needed realistic examples, guidance on best practices, and of course, lots and lots of practice.

There is a wealth of information on the internet — videos, tutorials, academic papers, and a terrific active user community — but I had to spend a lot of time weeding through it to find what I need. Because Spark is so new, and changing so fast, it was hard to know on which information I could rely.

So, I set out to build a training course that goes beyond the basics, and give developers what they need to start using Spark to solve their Big Data problems: Cloudera Developer Training for Apache Spark.

Beyond WordCount

Understanding how Spark operates under the hood is the key to writing efficient code that best takes advantage of Spark’s built-in features. The course certainly covers the basics, like how to create and operate on RDDs, but it quickly goes beyond them. For instance, you’ll explore how Spark constructs Directed Acyclic Graphs, or DAGs, in order to execute tasks in parallel. You’ll learn how “narrow” operations like maps are pipelined together on a single node, whereas “wide” operations like grouping and reducing require shuffling results between cluster nodes:

Armed with this knowledge, you’ll learn techniques for minimizing expensive network shuffling like using shared variables, and favoring reduce operations (which reduce the data locally before shuffling across the network) over grouping (which shuffles all the data across the network).

Focus on Performance and Best Practices

Spark’s ability to distribute, cache, and process data in-memory offers huge advantages over MapReduce for important data processing jobs such as graph analysis and machine learning.  This course focuses on best practices for taking advantage of Spark’s capabilities  — such as how, when, and why to cache data in-memory, on local disk, or in HDFS. You will also learn about common performance bottlenecks and how to diagnose performance issues using the Spark Application Web UI. You will get hands-on experience solving performance problems using shared variables, checkpointing, and repartitioning.

Practice, Practice, Practice

I know from nearly 20 years of educating software developers that the only way to learn a new technology is to practice applying it – not just typing in commands as instructed, but by actually having to apply knowledge to real-world problems. And when I’m learning a new technology, I learn as much through my mistakes as my successes.

I wrote the course exercises to start with simple step-by-step instructions, and then moved on to challenge participants to think about how to solve realistic data processing problems like text-mining log files, correlating data from different sources in a variety of formats, and implementing analysis algorithms.

No Experience Necessary

I designed this course to be equally useful whether you are brand new to Big Data processing or an old hand at MapReduce and related technologies. Code examples and exercise solutions are available as either Python or Scala, so the only course requirement is experience developing applications in one of those two languages. We cover key related technologies such as cluster management, distributed file storage, and functional programming, with pointers to additional material for further study provided.

The growth and popularity of Spark over the last year has been huge, and adoption is accelerating; soon Spark will overtake MapReduce as the dominant technology for Big Data processing and analysis. This course is a great way for developers to get up to speed quickly and start using Spark to build faster, more flexible, easier-to-use applications.

Learn More

If you’d like to learn more, Cloudera is hosting a free webinar introducing Cloudera Developer Training for Apache Spark on Wed., July 23, at 10am PT/1pm ET. It will cover more about the course’s objectives, outline, prerequisites, and technical and business benefits, including a portion of the full training, plus Q&A. Register now!

You can also enroll in the full Spark Developer course by visiting Cloudera University. Public classes start in August and are currently scheduled in Redwood City, Calif., and Austin, with more class dates coming soon. Private training for your team is also available at your location and according to your schedule, so contact us for more information.

Diana Carroll is a curriculum designer for Cloudera University.