Category Archives: How-to

How-to: Use Apache ZooKeeper to Build Distributed Apps (and Why)

Categories: How-to ZooKeeper

It’s widely accepted that you should never design or implement your own cryptographic algorithms but rather use well-tested, peer-reviewed libraries instead. The same can be said of distributed systems: Making up your own protocols for coordinating a cluster will almost certainly result in frustration and failure.

Architecting a distributed system is not a trivial problem; it is very prone to race conditions, deadlocks, and inconsistency. Making cluster coordination fast and scalable is just as hard as making it reliable.

Read more

From Zero to Impala in Minutes

Categories: Cloud Guest How-to Impala

This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.

Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.

Read more

How-to: Do Apache Flume Performance Tuning (Part 1)

Categories: CDH Flume General How-to

The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.

This is Part 1 in a series of articles about tuning the performance of Apache Flume, a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of event data.

To kick off this series, I’d like to start off discussing some important Flume concepts that come into play when tuning your Flume flows for maximum performance: the channel and the transaction batch size.

Read more

How-to: Use a SerDe in Apache Hive

Categories: Hive How-to

Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.

Read more

How-to: Use the ShareLib in Apache Oozie

Categories: How-to MapReduce Oozie

Ed. Note: The post below pertains to CDH 4.x only. Read this post for updates concerning CDH 5.x.

As Apache Oozie, the workflow engine for Apache Hadoop, continues to receive wider adoption from our customers and the community, we’re seeing patterns with respect to the biggest challenges for users. One such point of difficulty is setting up and using Oozie’s ShareLib for allowing JARs to be shared by different workflows.

Read more