Cloudera Engineering Blog · How-to Posts
This Spark Streaming use case is a great example of how near-real-time processing can be brought to Hadoop.
Spark Streaming is one of the most interesting components within the Apache Spark stack. With Spark Streaming, you can create data pipelines that process streamed data using the same API that you use for processing batch-loaded data. Furthermore, Spark Steaming’s “micro-batching” approach provides decent resiliency should a job fail for some reason.
The combination of OpenShift and Kite SDK turns out to be an effective one for developing and testing Apache Hadoop applications.
At Cloudera, our engineers develop a variety of applications on top of Hadoop to solve our own data needs (here and here). More recently, we’ve started to look at streamlining our development process by using a PaaS (Platform-as-a-Service) for some of these applications. Having single-click deployment and updates to consistent development environments lets us onboard new developers more quickly, and helps ensure that code is written and tested along patterns that will ensure high quality.
Using this new tutorial alongside Cloudera Live is now the fastest, easiest, and most hands-on way to get started with Hadoop.
At Cloudera, developer enablement is one of our most important objectives. One only has to look at examples from history (Java or SQL, for example) to know that knowledge fuels the ecosystem. That objective is what drives initiatives such as our community forums, the Cloudera QuickStart VM, and this blog itself.
The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.
The ability to quickly and accurately count complex events is a legitimate business advantage.
In our work as data scientists, we spend most of our time counting things. It is the foundational skill that is used in data cleansing, reporting, feature engineering, and simple-but-effective machine learning models like Naive Bayes classifiers. Hilary Mason has a quote about the benefits of counting that I love:
IPython Notebook and Spark’s Python API are a powerful combination for data science.
The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.
With this new release, setting up a separate MIT KDC for cluster authentication services is no longer necessary.
Kerberos (initially developed by MIT in the 1980s) has been adopted by every major component of the Apache Hadoop ecosystem. Consequently, Kerberos has become an integral part of the security infrastructure for the enterprise data hub (EDH).
Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch.
In a previous blog post, I described a data-driven market study based on Wikipedia access data and content. I explained how useful it is to combine several public data sources, and how this approach sheds light onto the hidden correlations across Wikipedia pages.
Prefer IntelliJ IDEA over Eclipse? We’ve got you covered: learn how to get ready to contribute to Apache Hadoop via an IntelliJ project.
It’s generally useful to have an IDE at your disposal when you’re developing and debugging code. When I first started working on HDFS, I used Eclipse, but I’ve recently switched to JetBrains’ IntelliJ IDEA (specifically, version 13.1 Community Edition).
It’s been a while since we provided a how-to for this purpose. Thanks, Daan Debie (@DaanDebie), for allowing us to re-publish the instructions below (for CDH 5)!
I recently started as a Big Data Engineer at The New Motion. While researching our best options for running an Apache Hadoop cluster, I wanted to try out some of the features available in the newest version of Cloudera’s Hadoop distribution: CDH 5. Of course I could’ve downloaded the QuickStart VM, but I rather wanted to run a virtual cluster, making use of the 16GB of RAM my shiny new 15″ Retina Macbook Pro has ;)