Category Archives: Avro

Apache Flume Development Status Update

Categories: Avro Data Ingestion Flume General Hadoop HBase

Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG),

Read More

Apache Avro at RichRelevance

Categories: Avro Community Guest

This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.

In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough,

Read More

Apache Flume – Architecture of Flume NG

Categories: Avro Community Data Ingestion Flume General Hadoop

This blog was originally posted on the Apache Blog:

Apache Flume is a distributed, reliable, and available system for efficiently collecting, aggregating and moving large amounts of log data from many different sources to a centralized data store. Flume is currently undergoing incubation at The Apache Software Foundation. More information on this project can be found at Flume NG is work related to new major revision of Flume and is the subject of this post.

Read More

Hadoop World 2011: A Glimpse into Development

Categories: Avro Careers CDH Community Flume General Hadoop HBase HDFS Hive MapReduce Oozie Pig Sqoop Training Use Case ZooKeeper

The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools,

Read More

Introducing Crunch: Easy MapReduce Pipelines for Apache Hadoop

Categories: Avro General Hadoop MapReduce

As a data scientist at Cloudera, I work with customers across a wide range of industries that use Apache Hadoop to solve their business problems. Many of the solutions we create involve multi-stage pipelines of MapReduce jobs that join, clean, aggregate, and analyze enormous amounts of data. When working with log files or relational database tables, we use high-level tools like ApacheĀ Pig and ApacheĀ Hive for their convenient and powerful support for creating pipelines over structured and semi-structured records.

Read More