Cloudera Engineering Blog

Big Data best practices, how-to's, and internals from Cloudera Engineering and the community


Apache Hadoop 2.5.0 is Released

The Apache Hadoop community has voted to release Apache Hadoop 2.5.0.

Apache Hadoop 2.5.0 is a minor release in the 2.x release line and includes some major features and improvements, including:

How-to: Use IPython Notebook with Apache Spark

IPython Notebook and Spark’s Python API are a powerful combination for data science.

The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.

New in CDH 5.1: HDFS Read Caching

Applications using HDFS, such as Impala, will be able to read data up to 59x faster thanks to this new feature.

Server memory capacity and bandwidth have increased dramatically over the last few years. Beefier servers make in-memory computation quite attractive, since a lot of interesting data sets can fit into cluster memory, and memory is orders of magnitude faster than disk.

This Month in the Ecosystem (July 2014)

Welcome to our 11th edition of “This Month in the Ecosystem,” a digest of highlights from July 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Progress Report: Cloudera Community Forums After One Year

Cloudera Community forums are proving their value as an important contributor to a rich user experience.

It’s been almost exactly one year since the debut of the Cloudera Community forums. In addition to doing the birthday shout-out, I thought it would be interesting to bring you up to date about adoption and usage patterns.

Meet the Engineer: Sravya Tirukkovalur

Meet Sravya Tirukkovalur (@sravsatuluri), a Software Engineer working on Apache Hadoop security at Cloudera.

What do you do at Cloudera, and in which Apache projects are you involved?

New in CDH 5.1: Hue’s Improved Search App

An improved Search app in Hue 3.6 makes the Hadoop user experience even better.

Hue 3.6 (now packaged in CDH 5.1) has brought the second version of the Search App up to even higher standards. The user experience has been greatly improved, as the app now provides a very easy way to build custom dashboards and visualizations.

What’s New in Kite SDK 0.15.0?

Kite SDK’s new release contains new improvements that make working with data easier.

Recently, Kite SDK, the open source toolset that helps developers build systems on the Apache Hadoop ecosystem, became a 0.15.0. In this post, you’ll get an overview of several new features and bug fixes.

Working with Datasets by URI

New in CDH 5.1: Apache Spark 1.0

Spark 1.0 reflects a lot of hard work from a very diverse community.

Cloudera’s latest platform release, CDH 5.1, includes Apache Spark 1.0, a milestone release for the Spark project that locks down APIs for Spark’s core functionality. The release reflects the work of hundreds of contributors (including our own Diana Carroll, Mark Grover, Ted Malaska, Colin McCabe, Sean Owen, Hari Shreedharan, Marcelo Vanzin, and me).

New in Cloudera Manager 5.1: Direct Active Directory Integration for Kerberos Authentication

With this new release, setting up a separate MIT KDC for cluster authentication services is no longer necessary.

Kerberos (initially developed by MIT in the 1980s) has been adopted by every major component of the Apache Hadoop ecosystem. Consequently, Kerberos has become an integral part of the security infrastructure for the enterprise data hub (EDH).

New in CDH 5.1: Document-level Security for Cloudera Search

Cloudera Search now supports fine-grain access control via document-level security provided by Apache Sentry.

In my previous blog post, you learned about index-level security in Apache Sentry (incubating) and Cloudera Search. Although index-level security is effective when the access control requirements for documents in a collection are homogenous, often administrators want to restrict access to certain subsets of documents in a collection.

New Apache Spark Developer Training: Beyond the Basics

While the new Spark Developer training from Cloudera University is valuable for developers who are new to Big Data, it’s also a great call for MapReduce veterans.

When I set out to learn Apache Spark (which ships inside Cloudera’s open source platform) about six months ago, I started where many other people do: by following the various online tutorials available from UC Berkeley’s AMPLab, the creators of Spark. I quickly developed an appreciation for the elegant, easy-to-use API and super-fast results, and was eager to learn more.

Cloudera Enterprise 5.1 is Now Available

Cloudera Enterprise’s newest release contains important new security and performance features, and offers support for the latest innovations in the open source platform.

We’re pleased to announce the release of Cloudera Enterprise 5.1 (comprising CDH 5.1, Cloudera Manager 5.1, and Cloudera Navigator 2.0).

Jay Kreps, Apache Kafka Architect, Visits Cloudera

It was good to see Jay Kreps (@jaykreps), the LinkedIn engineer who is the tech lead for that company’s online data infrastructure, visit Cloudera Engineering yesterday to spread the good word about Apache Kafka.

Kafka, of course, was originally developed inside LinkedIn and entered the Apache Incubator in 2011. Today, it is being widely adopted as a pub/sub framework that works at massive scale (and which is commonly used to write to Apache Hadoop clusters, and even data warehouses).

The New Hadoop Application Architectures Book is Here!

There’s an important new addition coming to the Apache Hadoop book ecosystem. It’s now in early release!

We are very happy to announce that the new Apache Hadoop book we have been writing for O’Reilly Media, Hadoop Application Architectures, is now available as an early release! It contains the first two chapters and can be found in O’Reilly’s Catalog and via Safari.        

Estimating Financial Risk with Apache Spark

Learn how Spark facilitates the calculation of computationally-intensive statistics such as VaR via the Monte Carlo method.

Under reasonable circumstances, how much money can you expect to lose? The financial statistic value at risk (VaR) seeks to answer this question. Since its development on Wall Street soon after the stock market crash of 1987, VaR has been widely adopted across the financial services industry. Some organizations report the statistic to satisfy regulations, some use it to better understand the risk characteristics of large portfolios, and others compute it before executing trades to help make informed and immediate decisions.

This Month in the Ecosystem (June 2014)

Welcome to our 10th edition of “This Month in the Ecosystem,” a digest of highlights from June 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Pretty busy for early Summer:

Jeff Dean’s Talk at Cloudera

Google’s Jeff Dean — among the original architects of MapReduce, Bigtable, and Spanner — revealed some fascinating facts about Google’s internal environment at Cloudera HQ recently.

Earlier this week, we were pleased to welcome Google Senior Fellow Jeff Dean to Cloudera’s Palo Alto HQ to give an overview of some of his group’s current research. Jeff has a peerless pedigree in distributed computing circles, having been deeply involved in the design and implementation of Google’s original advertising serving system, MapReduce, Bigtable, Spanner, and a host of other projects.

How-to: Build Advanced Time-Series Pipelines in Apache Crunch

Learn how creating dataflow pipelines for time-series analysis is a lot easier with Apache Crunch.

In a previous blog post, I described a data-driven market study based on Wikipedia access data and content. I explained how useful it is to combine several public data sources, and how this approach sheds light onto the hidden correlations across Wikipedia pages.

Apache Hive on Apache Spark: Motivations and Design Principles

Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.

(Editor’s note [Feb. 25, 2015]: A Hive-on-Spark beta release is now available for download. Learn more here.)

Why Extended Attributes are Coming to HDFS

Extended attributes in HDFS will facilitate at-rest encryption for Project Rhino, but they have many other uses, too.

Many mainstream Linux filesystems implement extended attributes, which let you associate metadata with a file or directory beyond common “fixed” attributes like filesize, permissions, modification dates, and so on. Extended attributes are key/value pairs in which the values are optional; generally, the key and value sizes are limited to some implementation-specific limit. A filesystem that implements extended attributes also provides system calls and shell commands to get, list, set, and remove attributes (and values) to/from a file or directory.

Where to Find Cloudera Tech Talks (Through September 2014)

Find Cloudera tech talks in Texas, Oregon, Washington DC, Illinois, Georgia, Japan, and across the SF Bay Area during the next calendar quarter.

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees – this time, for the third calendar quarter of 2014 (July through September; traditionally, the least active quarter of the year). Note that this list will be continually curated during the period; complete logistical information may not be available yet. And remember, many of these talks are in “free” venues (no cost of entry).

How-to: Create an IntelliJ IDEA Project for Apache Hadoop

Prefer IntelliJ IDEA over Eclipse? We’ve got you covered: learn how to get ready to contribute to Apache Hadoop via an IntelliJ project.

It’s generally useful to have an IDE at your disposal when you’re developing and debugging code. When I first started working on HDFS, I used Eclipse, but I’ve recently switched to JetBrains’ IntelliJ IDEA (specifically, version 13.1 Community Edition).

How-to: Install a Virtual Apache Hadoop Cluster with Vagrant and Cloudera Manager

It’s been a while since we provided a how-to for this purpose. Thanks, Daan Debie (@DaanDebie), for allowing us to re-publish the instructions below (for CDH 5)!

I recently started as a Big Data Engineer at The New Motion. While researching our best options for running an Apache Hadoop cluster, I wanted to try out some of the features available in the newest version of Cloudera’s Hadoop distribution: CDH 5. Of course I could’ve downloaded the QuickStart VM, but I rather wanted to run a virtual cluster, making use of the 16GB of RAM my shiny new 15″ Retina Macbook Pro has ;)

Meet the Data Scientist: Sandy Ryza

Meet Sandy Ryza (@SandySifting), the newest member of Cloudera’s data science team. See Sandy present at Spark Summit 2014 (June 30-July 1 in San Francisco; register here for a 20% discount).

What is your definition of a “data scientist”?

Project Rhino Goal: At-Rest Encryption for Apache Hadoop

An update on community efforts to bring at-rest encryption to HDFS — a major theme of Project Rhino.

Encryption is a key requirement for many privacy and security-sensitive industries, including healthcare (HIPAA regulations), card payments (PCI DSS regulations), and the US government (FISMA regulations).

How-to: Easily Do Rolling Upgrades with Cloudera Manager

Unique across all options, Cloudera Manager makes it easy to do what would otherwise be a disruptive operation for operators and users.

For the increasing number of customers that rely on enterprise data hubs (EDHs) for business-critical applications, it is imperative to minimize or eliminate downtime — thus, Cloudera has focused intently on making software upgrades a routine, non-disruptive operation for EDH administrators and users.

This Month in the Ecosystem (May 2014)

Welcome to our ninth edition of “This Month in the Ecosystem,” a digest of highlights from May/early June 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

More good news!

Capacity Planning with Big Data and Cloudera Manager

Thanks to Bill Podell, VP Big Data and BI Practice, MBI Solutions, for the guest post below.

Capacity planning has long been a critical component of successful implementations for production systems. Today, Big Data calls for a particularly deep understanding of capacity management – because resource utilization explodes as business users, analysts, and data scientists jump onboard to analyze and use newly found data. The resource impact can escalate very quickly, causing poor loading and or response times. The result is throwing more hardware at the issue without any understanding of what impact the new hardware will have on the current issue. Better yet, be proactive and know about the problem before the problem even occurs!

How-to: Use Kite SDK to Easily Store and Configure Data in Apache Hadoop

Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.

Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level; all you get is a filesystem and whatever else you can dream up — well, code up).

Apache Spark 1.0 is Released

Spark 1.0 is its biggest release yet, with a list of new features for enterprise customers.

Congratulations to the Apache Spark community for today’s release of Spark 1.0, which includes contributions from more than 100 people (including Cloudera’s own Diana Carroll, Mark Grover, Ted Malaska, Sean Owen, Sandy Ryza, and Marcelo Vanzin). We think this release is an important milestone in the continuing rapid uptake of Spark by enterprises — which is supported by Cloudera via Cloudera Enterprise 5 — as a modern, general-purpose processing engine for Apache Hadoop.

Apache Spark Resource Management and YARN App Models

A concise look at the differences between how Spark and MapReduce manage cluster resources under YARN

The most popular Apache YARN application after MapReduce itself is Apache Spark. At Cloudera, we have worked hard to stabilize Spark-on-YARN (SPARK-1101), and CDH 5.0.0 added support for Spark on YARN clusters.

New SQL Choices in the Apache Hadoop Ecosystem: Why Impala Continues to Lead

Impala continues to demonstrate performance leadership compared to alternatives (by 950% or more), while providing greater query throughput and with a far smaller CPU footprint.

In our previous post from January 2014, we reported that Impala had achieved query performance over Apache Hadoop equivalent to that of an analytic DBMS over its own proprietary storage system. We believed this was an important milestone because Impala’s objective has been to support a high-quality BI experience on Hadoop data, not to produce a “faster Apache Hive.” An enterprise-quality BI experience requires low latency and high concurrency (among other things), so surpassing a well-known proprietary MPP DBMS in these areas was important evidence of progress.
 
In the past nine months, we’ve also all seen additional public validation that the original technical design for Hive, while effective for batch processing, was a dead-end for BI workloads. Recent examples have included the launch of Facebook’s Presto engine (Facebook was the inventor and world’s largest user of Hive), the emergence of Shark (Hive running on the Apache Spark DAG), and the “Stinger” initiative (Hive running on the Apache Tez [incubating] DAG).
 
Given the introduction of a number of new SQL-on-Hadoop implementations it seemed like a good time to do a roundup of the latest versions of each engine to see how they differ. We find that Impala maintains a significant performance advantage over the various other open source alternatives — ranging from 5x to 23x depending on the workload and the implementations that are compared. This advantage is due to some inherent design differences among the various systems, which we’ll explain below. Impala’s advantage is strongest for multi-user workloads, which arguably is the most relevant measure for users evaluating their options for BI use cases.

Configuration

Cluster

How-to: Manage Time-Dependent Multilayer Networks in Apache Hadoop

Using an appropriate network representation and the right tool set are the key factors in successfully merging structured and time-series data for analysis.

In Part 1 of this series, you took your first steps for using Apache Giraph, the highly scalable graph-processing system, alongside Apache Hadoop. In this installment, you’ll explore a general use case for analyzing time-dependent, Big Data graphs using data from multiple sources. You’ll learn how to generate random large graphs and small-world networks using Giraph – as well as play with several parameters to probe the limits of your cluster.

Congratulations to Parquet, Now an Apache Incubator Project

Yesterday, Parquet was accepted into the Apache Incubator. Congratulations to all the contributors to what will eventually become Apache Parquet!

In its relatively short lifetime (co-founded by Twitter and Cloudera in July 2013), Parquet has already become the de facto standard for columnar storage of Apache Hadoop data — with native support in Impala, Apache Hive, Apache Pig, Apache Spark, MapReduce, Apache Tajo, Apache Drill, Apache Crunch, and Cascading (and forthcoming in Presto and Shark). Parquet adoption is also broad-based, with employees of the following companies (partial list) actively contributing:

How-to: Configure JDBC Connections in Secure Apache Hadoop Environments

Learn how HiveServer, Apache Sentry, and Impala help make Hadoop play nicely with BI tools when Kerberos is involved.

In 2010, I wrote a simple pair of blog entries outlining the general considerations behind using Apache Hadoop with BI tools. The Cloudera partner ecosystem has positively exploded since then, and the technology has matured as well. Today, if JDBC is involved, all the pieces needed to expose Hadoop data through familiar BI tools are available:

How-to: Convert Existing Data into Parquet

Learn how to convert your data to the Parquet columnar format to get big performance gains.

Using a columnar storage format for your data offers significant performance advantages for a large subset of real-world queries. (Click here for a great introduction.)

Meet the Data Scientist: Alan Paulsen

Meet Alan Paulsen, among the first to earn the CCP: Data Scientist distinction.

Big Data success requires professionals who can prove their mastery with the tools and techniques of the Apache Hadoop stack. However, experts predict a major shortage of advanced analytics skills over the next few years. At Cloudera, we’re drawing on our industry leadership and early corpus of real-world experience to address the Big Data talent gap with the Cloudera Certified Professional (CCP) program.

New Training: Design and Build Big Data Applications

Cloudera’s new “Designing and Building Big Data Applications” is a great springboard for writing apps for an enterprise data hub.

Cloudera’s vision of an enterprise data hub as a central, scalable repository for all your data is changing the notion of data warehousing. The best way to gain value from all of your data is by bringing more workloads to where the data lives. That place is Apache Hadoop.

Using Impala at Scale at Allstate

Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working as a Principal Big Data Consultant at Allstate Insurance, for the guest post below about his experiences with Impala.

It started with a simple request from one of the managers in my group at Allstate to put together a demo of Tableau connecting to Cloudera Impala. I had previously worked on Impala with a large dataset about a year ago while it was still in beta, and was curious to see how Impala had improved since then in features and stability.

How-to: Process Time-Series Data Using Apache Crunch

Did you know that using the Crunch API is a powerful option for doing time-series analysis?

Apache Crunch is a Java library for building data pipelines on top of Apache Hadoop. (The Crunch project was originally founded by Cloudera data scientist Josh Wills.) Developers can spend more time focused on their use case by using the Crunch API to handle common tasks such as joining data sets and chaining jobs together in a pipeline. At Cloudera, we are so enthusiastic about Crunch that we have included it in CDH 5! (You can get started with Apache Crunch here and here.)

How-to: Use the ShareLib in Apache Oozie (CDH 5)

The internals of Oozie’s ShareLib have changed recently (reflected in CDH 5.0.0). Here’s what you need to know.

In a previous blog post about one year ago, I explained how to use the Apache Oozie ShareLib in CDH 4. Since that time, things have changed about the ShareLib in CDH 5 (particularly directory structure), so some of the previous information is now obsolete. (These changes went upstream under OOZIE-1619.) 

This Month in the Ecosystem (April 2014)

Welcome to our eighth edition of “This Month in the Ecosystem,” a digest of highlights from April 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

More good news!

How Apache Hadoop YARN HA Works

Thanks to recent work upstream, YARN is now a highly available service. This post explains its architecture and configuration details.

YARN, the next-generation compute and resource management framework in Apache Hadoop, until recently had a single point of failure: the ResourceManager, which coordinates work in a YARN cluster. With planned (upgrades) or unplanned (node crashes) events, this central service, and YARN itself, could become unavailable.

HBaseCon 2014 is a Wrap!

HBaseCon 2014 is in the books. Thanks to all attendees, speakers, and sponsors!

HBaseCon 2014, much like a butterfly, lived for a short number of hours on Monday — but it certainly was beautiful while it lasted! (See photos here.)

A New Python Client for Impala

The new Python client for Impala will bring smiles to Pythonistas!

As a data scientist, I love using the Python data stack. I also love using Impala to work with very large data sets. But things that take me out of my Python workflow are generally considered hassles; so it’s annoying that my main options for working with Impala are to write shell scripts, use the Impala shell, and/or transfer query results by reading/writing local files to disk.

How-to: Extend Cloudera Manager with Custom Service Descriptors

Thanks to Jonathan Natkins of WibiData for the post below about how his company extended Cloudera Manager to manage Kiji. Learn more about Kiji and the organizations using it to build real-time HBase applications at Kiji Sessions, happening on May 6, 2014, the day after HBaseCon.

As a partner of Cloudera, WibiData sees Cloudera Manager’s new extensibility framework as one of the most exciting parts of Cloudera Enterprise 5. Cloudera Manager 5.0.0 provides the single-pane view that Apache Hadoop administrators and operators want to effectively manage a cluster of machines. Additionally, Cloudera Manager now offers tight integration for partners to plug into the CDH ecosystem, which benefits Cloudera as well as WibiData.

Bringing the Best of Apache Hive 0.13 to CDH Users

More than 300 bug fixes and stable features in Apache Hive 0.13 have already been backported into CDH 5.0.0.

Last week, the Hive community voted to release Hive 0.13. We’re excited about the continued efforts and progress in the project and the latest release — congratulations to all contributors involved!

Using Apache Hadoop and Impala with MySQL for Data Analysis

Thanks to Alexander Rubin of Percona for allowing us to re-publish the post below!

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post.

Meet the Engineer: Andrei Savu

In this installment of “Meet the Engineer”, our subject is Andrei Savu!

What do you do at Cloudera?

Newer Posts Older Posts