Cloudera Developer Blog

Big Data best practices, how-to's, and internals from Cloudera Engineering and the community


Making Apache Spark Easier to Use in Java with Java 8

Our thanks to Prashant Sharma and Matei Zaharia of Databricks for their permission to re-publish the post below about future Java 8 support in Apache Spark. Spark is now generally available inside CDH 5.

One of Apache Spark‘s main goals is to make big data applications easier to write. Spark has always had concise APIs in Scala and Python, but its Java API was verbose due to the lack of function expressions. With the addition of lambda expressions in Java 8, we’ve updated Spark’s API to transparently support these expressions, while staying compatible with old versions of Java. This new support will be available in Spark 1.0.

A Few Examples

Meet the Data Scientist: Stuart Horsman

Meet Stuart Horsman, among the first to earn the CCP: Data Scientist distinction.

Big Data success requires professionals who can prove their mastery with the tools and techniques of the Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address the Big Data talent gap with the Cloudera Certified Professional (CCP) program.

How-to: Run a Simple Apache Spark App in CDH 5

Getting started with Spark (now shipping inside CDH 5) is easy using this simple example.

Apache Spark is a general-purpose, cluster computing framework that, like MapReduce in Apache Hadoop, offers powerful abstractions for processing large datasets. For various reasons pertaining to performance, functionality, and APIs, Spark is already becoming more popular than MapReduce for certain types of workloads. (For more background about Spark, read this post.)

How-to: Use cron-like Scheduling in Apache Oozie

Improved scheduling capabilities via Oozie in CDH 5 makes for far fewer headaches.

One of the best new Apache Oozie features in CDH 5, Cloudera’s software distribution, is the ability to use cron-like syntax for coordinator frequencies. Previously, the frequencies had to be at fixed intervals (every hour or every two days, for example) – making scheduling anything more complicated (such as every hour from 9am to 5pm on weekdays or the second-to-last day of every month) complex and difficult. 

Hello, Apache Hadoop 2.4.0

The community has voted to release Apache Hadoop 2.4.0.

Hadoop 2.4.0 includes myriad improvements to HDFS and MapReduce, including (but not limited to):

Sneak Preview: "Ecosystem" Track at HBaseCon 2014

The HBaseCon 2014 “Ecosystem” track offers a cross-section view of the most interesting projects emerging on top of, or alongside, HBase.

The HBaseCon 2014 (May 5, 2014 in San Francisco) is not just a reflection of HBase itself — it’s also a celebration of the entire ecosystem. Thanks again, Program Committee!

Hue Flies High at Goibibo

Our thanks to Amar Parkash, a Software Developer at Goibibo, a leading travel portal in India, for the enthusiastic support of Hue you’ll read below.

At Goibibo, we use Hue in our production environment. I came across Hue while looking for a near real-time log search tool and got to know about Cloudera Search and the interface provided by Hue. I tried it on my machine and was really impressed by the UI it provides for Apache Hive, Apache Pig, HDFS, job browser, and basically everything in the Big Data domain. We immediately deployed Hue in production, and that has been one of the best decisions we have ever made for our data platform at Goibibo.

How-to: Process Data using Morphlines (in Kite SDK)

Our thanks to Janos Matyas, CTO and Founder of SequenceIQ, for the guest post below about his company’s use case for Morphlines (part of the Kite SDK).

SequenceIQ has an Apache Hadoop-based platform and API that consume and ingest various types of data from different sources to offer predictive analytics and actionable insights. Our datasets are structured, unstructured, log files, and communication records, and they require constant refining, cleaning, and transformation.

This Month in the Ecosystem (March 2014)

Welcome to our seventh edition of “This Month in the Ecosystem,” a digest of highlights from March 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

More good news for the ecosystem!

Sneak Preview: "Features & Internals" Track at HBaseCon 2014

The HBaseCon 2014 “Features & Internals” track covers the newest developments in Apache HBase functionality.

The HBaseCon 2014 (May 5, 2014 in San Francisco) agenda has something for everyone – particularly, developers building apps on HBase. Thanks again, Program Committee!

Older Posts