Cloudera Developer Blog

Big Data best practices, how-to's, and internals from Cloudera Engineering and the community


This Month in the Ecosystem (June 2014)

Welcome to our 10th edition of “This Month in the Ecosystem,” a digest of highlights from June 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Pretty busy for early Summer:

Jeff Dean’s Talk at Cloudera

Google’s Jeff Dean — among the original architects of MapReduce, Bigtable, and Spanner — revealed some fascinating facts about Google’s internal environment at Cloudera HQ recently.

Earlier this week, we were pleased to welcome Google Senior Fellow Jeff Dean to Cloudera’s Palo Alto HQ to give an overview of some of his group’s current research. Jeff has a peerless pedigree in distributed computing circles, having been deeply involved in the design and implementation of Google’s original advertising serving system, MapReduce, Bigtable, Spanner, and a host of other projects.

How-to: Build Advanced Time-Series Pipelines in Apache Crunch

Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch.

In a previous blog post, I described a data-driven market study based on Wikipedia access data and content. I explained how useful it is to combine several public data sources, and how this approach sheds light onto the hidden correlations across Wikipedia pages.

Apache Hive on Apache Spark: Motivations and Design Principles

Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.

Apache Hive is a popular SQL interface for batch processing and ETL using Apache Hadoop. Until recently, MapReduce was the only execution engine in the Hadoop ecosystem, and Hive queries could only run on MapReduce. But today, alternative execution engines to MapReduce are available — such as Apache Spark and Apache Tez (incubating).

Why Extended Attributes are Coming to HDFS

Extended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too.

Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. A filesystem that implements extended attributes also provides system calls and shell commands to get, list, set, and remove attributes (and values) to/from a file or directory.

Where to Find Cloudera Tech Talks (Through September 2014)

Find Cloudera tech talks in Texas, Oregon, Washington DC, Illinois, Georgia, Japan, and across the SF Bay Area during the next calendar quarter.

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees – this time, for the third calendar quarter of 2014 (July through September; traditionally, the least active quarter of the year). Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).

How-to: Create an IntelliJ IDEA Project for Apache Hadoop

Prefer IntelliJ IDEA over Eclipse? We’ve got you covered: learn how to get ready to contribute to Apache Hadoop via an IntelliJ project.

It’s generally useful to have an IDE at your disposal when you’re developing and debugging code. When I first started working on HDFS, I used Eclipse, but I’ve recently switched to JetBrains’ IntelliJ IDEA (specifically, version 13.1 Community Edition).

How-to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager

It’s been a while since we provided a how-to for this purpose. Thanks, Daan Debie (@DaanDebie), for allowing us to re-publish the instructions below (for CDH 5)!

I recently started as a Big Data Engineer at The New Motion. While researching our best options for running an Apache Hadoop cluster, I wanted to try out some of the features available in the newest version of Cloudera’s Hadoop distribution: CDH 5. Of course I could’ve downloaded the QuickStart VM, but I rather wanted to run a virtual cluster, making use of the 16GB of RAM my shiny new 15″ Retina Macbook Pro has ;)

Meet the Data Scientist: Sandy Ryza

Meet Sandy Ryza (@SandySifting), the newest member of Cloudera’s data science team. See Sandy present at Spark Summit 2014 (June 30-July 1 in San Francisco; register here for a 20% discount).

What is your definition of a “data scientist”?

Project Rhino Goal: At-Rest Encryption for Apache Hadoop

An update on community efforts to bring at-rest encryption to HDFS — a major theme of Project Rhino.

Encryption is a key requirement for many privacy and security-sensitive industries, including healthcare (HIPAA regulations), card payments (PCI DSS regulations), and the US government (FISMA regulations).

Newer Posts Older Posts