Cloudera Blog · Flume Posts
How-to: Analyze Twitter Data with Hue
Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.
This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.
Collecting Data
The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.
Meet the Engineer: Kathleen Ting
In this installment of “Meet the Engineer”, get to know Customer Operations Engineering Manager/Apache Sqoop committer Kathleen Ting (@kate_ting).
What do you do at Cloudera, and in what open-source projects are you involved?
I’m a support manager at Cloudera, and an Apache Sqoop committer and PMC member. I also contribute to the Apache Flume and Apache ZooKeeper mailing lists and organize and present at meetups, as well as speak at conferences, about those projects.
My role is a hybrid “player/coach” model: in addition to doing managerial things like leading a team and addressing customer escalations, I also answer customer support cases directly, which is a fairly unique combination. This is an effective approach: giving me direct insights into customer concerns that I otherwise wouldn’t get, helping me stay grounded, and ensuring I appreciate the work the team is doing, first-hand.
How-to: Do Apache Flume Performance Tuning (Part 1)
The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.
This is Part 1 in a series of articles about tuning the performance of Apache Flume, a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of event data.
To kick off this series, I’d like to start off discussing some important Flume concepts that come into play when tuning your Flume flows for maximum performance: the channel and the transaction batch size.
Setting Up a Data Flow
Apache Hadoop in 2013: The State of the Platform
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.
Apache Hadoop Releases
Streaming Data into Apache HBase using Apache Flume
The following post was originally published via blog.apache.org; we are re-publishing it here.
Apache Flume was conceived as a fault-tolerant ingest system for the Apache Hadoop ecosystem. Flume comes packaged with an HDFS Sink which can be used to write events into HDFS, and two different implementations of HBase sinks to write events into Apache HBase. You can read about the basic architecture of Apache Flume 1.x in this blog post. You can also read about how Flume’s File Channel persists events and still provides extremely high performance in an earlier blog post. In this article, we will explore how to configure Flume to write events into HBase, and write custom serializers to write events into HBase in a format of the user’s choice.
Data is stored in HBase as tables. Each table has one or more column families, and each column family has one or more columns. HBase stores all columns in a column family in physical proximity. Each row is identified by a key known as the row key. To insert data into HBase, the table name, column family, column name and row key have to be specified. More details on the HBase data model can be found in the HBase documentation.
Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume
This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.
Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.
Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.
About Apache Flume FileChannel
The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.
This blog post is about Apache Flume’s File Channel. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
FileChannel is a persistent Flume channel that supports writing to multiple disks in parallel and encryption.
Overview
Schedule This! Strata + Hadoop World Speakers from Cloudera
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
If you’re interested in community meetups as well, refer to my post from last week on that subject – several are planned.
| Session/Tutorial/Keynote Name | Presenter(s) | Date | Time |
| An Introduction to Hadoop | Mark Fei | Tues., Oct. 23 | 9am |
| Using HBase | Amandeep Khurana, Matteo Bertozzi | Tues., Oct. 23 | 9am |
| Testing Hadoop Applications | Tom Wheeler | Tues., Oct. 23 | 9am |
| Building a Large-scale Data Collection System Using Flume NG | Hari Shreedharan, Will McQueen, Arvind Prabhakar, Prasad Mujumdar, Mike Percy | Tues., Oct. 23 | 1:30pm |
| Given Enough Monkeys – Some Thoughts on Randomness | Jesse Anderson | Tues., Oct. 23 | 3:20pm |
| Keynote: Big Answers | Mike Olson | Weds., Oct. 24 | 8:55am |
| Large Scale ETL with Hadoop | Eric Sammer | Weds., Oct. 24 | 11:40am |
| HDFS – What is New and Future | Todd Lipcon (co-presenter) | Weds., Oct. 24 | 4:10pm |
| High Availability for the HDFS NameNode: Phase 2 | Aaron Myers, Todd Lipcon | Weds., Oct. 24 | 5pm |
| Plenary Session: Beyond Batch | Doug Cutting | Thurs., Oct. 25 | 9:20am |
| Upcoming Enterprise Features in Apache HBase 0.96 | Jon Hsieh | Thurs., Oct. 25 | 11:40am |
| Data Science on Hadoop: What’s There and What’s Missing | Justin Erickson | Thurs., Oct. 25 | 1:40pm |
| Taming the Elephant – Learn How Monsanto Manages Their Hadoop Cluster to Enable Genome/Sequence Processing | Bala Venkatrao, Aparna Ramani (with others) | Thurs., Oct. 25 | 4:10pm |
| Knitting Boar | Josh Patterson, Michael Katzenellenbogen | Thurs., Oct. 25 | 4:10pm |
Analyzing Twitter Data with Apache Hadoop
Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.
In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.
Who is Influential?
To understand whom we should target, let’s take a step back and try to understand the mechanics of Twitter. A user – let’s call him Joe – follows a set of people, and has a set of followers. When Joe sends an update out, that update is seen by all of his followers. Joe can also retweet other users’ updates. A retweet is a repost of an update, much like you might forward an email. If Joe sees a tweet from Sue, and retweets it, all of Joe’s followers see Sue’s tweet, even if they don’t follow Sue. Through retweets, messages can get passed much further than just the followers of the person who sent the original tweet. Knowing that, we can try to engage users whose updates tend to generate lots of retweets. Since Twitter tracks retweet counts for all tweets, we can find the users we’re looking for by analyzing Twitter data.
CDH3 update 5 is now available
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following: