Author Archives: Brock Noland

Native Parquet Support Comes to Apache Hive

Categories: Hive Impala Parquet

Bringing Parquet support to Hive was a community effort that deserves congratulations!

Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)

Before discussing the Parquet Hive integration,

Read more

About Apache Flume FileChannel

Categories: Data Ingestion Flume General

The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.

This blog post is about Apache Flume’s File Channel. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms.

Read more

Crunch for Dummies

Categories: General

This guide is intended to be an introduction to Crunch.

Introduction

Crunch is used for processing data. Crunch builds on top of Apache Hadoop to provide a simpler interface for Java programmers to process data. In Crunch you create pipelines, not unlike Unix pipelines, such as the command below:

Crunch pipelines consist of a series of functions you apply to the input data. Let’s say you have raw Apache HTTPD server logs and that you want to know the total amount of data downloaded by ip address.

Read more