Cloudera Engineering Blog

Big Data best practices, how-to's, and internals from Cloudera Engineering and the community


How-to: Manage Time-Dependent Multilayer Networks in Apache Hadoop

Using an appropriate network representation and the right tool set are the key factors in successfully merging structured and time-series data for analysis.

In Part 1 of this series, you took your first steps for using Apache Giraph, the highly scalable graph-processing system, alongside Apache Hadoop. In this installment, you’ll explore a general use case for analyzing time-dependent, Big Data graphs using data from multiple sources. You’ll learn how to generate random large graphs and small-world networks using Giraph – as well as play with several parameters to probe the limits of your cluster.

Congratulations to Parquet, Now an Apache Incubator Project

Yesterday, Parquet was accepted into the Apache Incubator. Congratulations to all the contributors to what will eventually become Apache Parquet!

In its relatively short lifetime (co-founded by Twitter and Cloudera in July 2013), Parquet has already become the de facto standard for columnar storage of Apache Hadoop data — with native support in Impala, Apache Hive, Apache Pig, Apache Spark, MapReduce, Apache Tajo, Apache Drill, Apache Crunch, and Cascading (and forthcoming in Presto and Shark). Parquet adoption is also broad-based, with employees of the following companies (partial list) actively contributing:

How-to: Configure JDBC Connections in Secure Apache Hadoop Environments

Learn how HiveServer, Apache Sentry, and Impala help make Hadoop play nicely with BI tools when Kerberos is involved.

In 2010, I wrote a simple pair of blog entries outlining the general considerations behind using Apache Hadoop with BI tools. The Cloudera partner ecosystem has positively exploded since then, and the technology has matured as well. Today, if JDBC is involved, all the pieces needed to expose Hadoop data through familiar BI tools are available:

How-to: Convert Existing Data into Parquet

Learn how to convert your data to the Parquet columnar format to get big performance gains.

Using a columnar storage format for your data offers significant performance advantages for a large subset of real-world queries. (Click here for a great introduction.)

Meet the Data Scientist: Alan Paulsen

Meet Alan Paulsen, among the first to earn the CCP: Data Scientist distinction.

Big Data success requires professionals who can prove their mastery with the tools and techniques of the Apache Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address the Big Data talent gap with the Cloudera Certified Professional (CCP) program.

New Training: Design and Build Big Data Applications

Cloudera’s new “Designing and Building Big Data Applications” is a great springboard for writing apps for an enterprise data hub.

Cloudera’s vision of an enterprise data hub as a central, scalable repository for all your data is changing the notion of data warehousing. The best way to gain value from all of your data is by bringing more workloads to where the data lives. That place is Apache Hadoop.

Using Impala at Scale at Allstate

Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working as a Principal Big Data Consultant at Allstate Insurance, for the guest post below about his experiences with Impala.

It started with a simple request from one of the managers in my group at Allstate to put together a demo of Tableau connecting to Cloudera Impala. I had previously worked on Impala with a large dataset about a year ago while it was still in beta, and was curious to see how Impala had improved since then in features and stability.

How-to: Process Time-Series Data Using Apache Crunch

Did you know that using the Crunch API is a powerful option for doing time-series analysis?

Apache Crunch is a Java library for building data pipelines on top of Apache Hadoop. (The Crunch project was originally founded by Cloudera data scientist Josh Wills.) Developers can spend more time focused on their use case by using the Crunch API to handle common tasks such as joining data sets and chaining jobs together in a pipeline. At Cloudera, we are so enthusiastic about Crunch that we have included it in CDH 5! (You can get started with Apache Crunch here and here.)

How-to: Use the ShareLib in Apache Oozie (CDH 5)

The internals of Oozie’s ShareLib have changed recently (reflected in CDH 5.0.0). Here’s what you need to know.

In a previous blog post about one year ago, I explained how to use the Apache Oozie ShareLib in CDH 4. Since that time, things have changed about the ShareLib in CDH 5 (particularly directory structure), so some of the previous information is now obsolete. (These changes went upstream under OOZIE-1619.) 

This Month in the Ecosystem (April 2014)

Welcome to our eighth edition of “This Month in the Ecosystem,” a digest of highlights from April 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

More good news!

Newer Posts Older Posts