Cloudera Blog · Flume Posts

How-to: Analyze Twitter Data with Hue

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

Collecting Data

The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.

Meet the Engineer: Kathleen Ting

In this installment of “Meet the Engineer”, get to know Customer Operations Engineering Manager/Apache Sqoop committer Kathleen Ting (@kate_ting).

What do you do at Cloudera, and in what open-source projects are you involved?
I’m a support manager at Cloudera, and an Apache Sqoop committer and PMC member. I also contribute to the Apache Flume and Apache ZooKeeper mailing lists and organize and present at meetups, as well as speak at conferences, about those projects.

My role is a hybrid “player/coach” model: in addition to doing managerial things like leading a team and addressing customer escalations, I also answer customer support cases directly, which is a fairly unique combination. This is an effective approach: giving me direct insights into customer concerns that I otherwise wouldn’t get, helping me stay grounded, and ensuring I appreciate the work the team is doing, first-hand.

How-to: Do Apache Flume Performance Tuning (Part 1)

The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.

This is Part 1 in a series of articles about tuning the performance of Apache Flume, a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of event data.

To kick off this series, I’d like to start off discussing some important Flume concepts that come into play when tuning your Flume flows for maximum performance: the channel and the transaction batch size.

Setting Up a Data Flow

Apache Hadoop in 2013: The State of the Platform

For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.

In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.

Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.

Apache Hadoop Releases

Streaming Data into Apache HBase using Apache Flume

The following post was originally published via blog.apache.org; we are re-publishing it here.

Apache Flume was conceived as a fault-tolerant ingest system for the Apache Hadoop ecosystem. Flume comes packaged with an HDFS Sink which can be used to write events into HDFS, and two different implementations of HBase sinks to write events into Apache HBase. You can read about the basic architecture of Apache Flume 1.x in this blog post. You can also read about how Flume’s File Channel persists events and still provides extremely high performance in an earlier blog post. In this article, we will explore how to configure Flume to write events into HBase, and write custom serializers to write events into HBase in a format of the user’s choice.

Data is stored in HBase as tables. Each table has one or more column families, and each column family has one or more columns. HBase stores all columns in a column family in physical proximity. Each row is identified by a key known as the row key. To insert data into HBase, the table name, column family, column name and row key have to be specified. More details on the HBase data model can be found in the HBase documentation.

Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume

This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.

Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.

Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.

About Apache Flume FileChannel

The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.

This blog post is about Apache Flume’s File Channel. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

FileChannel is a persistent Flume channel that supports writing to multiple disks in parallel and encryption.

Overview

Schedule This! Strata + Hadoop World Speakers from Cloudera

We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)

The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.

If you’re interested in community meetups as well, refer to my post from last week on that subject – several are planned. 

Session/Tutorial/Keynote Name Presenter(s) Date Time
An Introduction to Hadoop Mark Fei Tues., Oct. 23 9am
Using HBase Amandeep Khurana, Matteo Bertozzi Tues., Oct. 23 9am
Testing Hadoop Applications Tom Wheeler Tues., Oct. 23 9am
Building a Large-scale Data Collection System Using Flume NG Hari Shreedharan, Will McQueen, Arvind Prabhakar, Prasad Mujumdar, Mike Percy Tues., Oct. 23 1:30pm
Given Enough Monkeys – Some Thoughts on Randomness Jesse Anderson Tues., Oct. 23 3:20pm
Keynote: Big Answers Mike Olson Weds., Oct. 24 8:55am
Large Scale ETL with Hadoop Eric Sammer Weds., Oct. 24 11:40am
HDFS – What is New and Future Todd Lipcon (co-presenter) Weds., Oct. 24 4:10pm
High Availability for the HDFS NameNode: Phase 2 Aaron Myers, Todd Lipcon Weds., Oct. 24 5pm
Plenary Session: Beyond Batch Doug Cutting Thurs., Oct. 25 9:20am
Upcoming Enterprise Features in Apache HBase 0.96 Jon Hsieh Thurs., Oct. 25 11:40am
Data Science on Hadoop: What’s There and What’s Missing Justin Erickson Thurs., Oct. 25 1:40pm
Taming the Elephant – Learn How Monsanto Manages Their Hadoop Cluster to Enable Genome/Sequence Processing Bala Venkatrao, Aparna Ramani (with others) Thurs., Oct. 25 4:10pm
Knitting Boar Josh Patterson, Michael Katzenellenbogen Thurs., Oct. 25 4:10pm

Analyzing Twitter Data with Apache Hadoop

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.

In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.

Who is Influential?

To understand whom we should target, let’s take a step back and try to understand the mechanics of Twitter. A user – let’s call him Joe – follows a set of people, and has a set of followers. When Joe sends an update out, that update is seen by all of his followers. Joe can also retweet other users’ updates. A retweet is a repost of an update, much like you might forward an email. If Joe sees a tweet from Sue, and retweets it, all of Joe’s followers see Sue’s tweet, even if they don’t follow Sue. Through retweets, messages can get passed much further than just the followers of the person who sent the original tweet. Knowing that, we can try to engage users whose updates tend to generate lots of retweets. Since Twitter tracks retweet counts for all tweets, we can find the users we’re looking for by analyzing Twitter data.

CDH3 update 5 is now available

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

Older Posts