How-to: Analyze Twitter Data with Hue
Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.
This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.
The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.
The second step consists of collecting some tweet data from the live Twitter stream.
Apache Flume is an elegant solution for taking care of this. The configuration of Flume is detailed in the readme and previous blog post. However, if you want to skip this step, some data is available on GitHub. Just upload it as a zip file in the home directory of the flume user and the “tweets” directory will show up after a few seconds.
If you are not taking this shortcut, create the tweets directory in the File Browser with the New Folder action.
Then, when the Flume agent is started, the data will start appearing:
Clicking on a file will display its content in the built-in viewer:
It is time to prepare the analysis of the tweet data. We’ll use Apache Hive, which can query the data with SQL-like syntax in a scalable way. The detailed description of the Hive setup is detailed in the readme.
When Hive is ready, the tweet table can be created in the query editor of Beeswax. Notice that the Hive SerDe (to download or compile here) must be included as a jar in the query. You can read more about Hive SerDe in this previous post.
To do this, just click on “Add” > “File Resources”, click on the path chooser button, click on the “Home” button, and upload hive-serdes-1.0-SNAPSHOT.jar.
Then just enter the
CREATE TABLE statement and execute it:
CREATE EXTERNAL TABLE tweets ( id BIGINT, created_at STRING, source STRING, favorited BOOLEAN, retweet_count INT, retweeted_status STRUCT< text:STRING, user:STRUCT<screen_name:STRING,name:STRING>>, entities STRUCT< urls:ARRAY<STRUCT<expanded_url:STRING>>, user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>, hashtags:ARRAY<STRUCT<text:STRING>>>, text STRING, user STRUCT< screen_name:STRING, name:STRING, friends_count:INT, followers_count:INT, statuses_count:INT, verified:BOOLEAN, utc_offset:INT, time_zone:STRING>, in_reply_to_screen_name STRING ) PARTITIONED BY (datehour INT) ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe' LOCATION '/user/flume/tweets'
Now that the table is created, let’s insert some data in the table. First, select the table in the “Table” tab and click “Import data”. Enter the path “/user/flume/tweets/2013/02/25/17″ and “201302251″ as the key:
Depending on the partition picked, a query similar to this will be generated:
LOAD DATA INPATH '/user/flume/tweets/2013/02/25/16' INTO TABLE `default.tweets` PARTITION (datehour='2013022516')
After the query executes, the table ‘tweets’ will be available.
Beeswax can access the Hive metastore and its list of tables. A description of their schema and partitions with some example of data contained in each table are helpful while designing your queries. Moreover, a wizard can guide you step-by-step to create new tables.
Analysis with Beeswax
SELECT t.retweeted_screen_name, sum(retweets) AS total_retweets, count(*) AS tweet_count FROM (SELECT retweeted_status.user.screen_name as retweeted_screen_name, retweeted_status.text, max(retweet_count) as retweets FROM tweets GROUP BY retweeted_status.user.screen_name, retweeted_status.text) t GROUP BY t.retweeted_screen_name ORDER BY total_retweets DESC LIMIT 10;
Beeswax possesses multiple features for providing a better user experience than the command line shell. For example you can save queries and share them with other users. The result of a query can be exported into a new table or an HDFS file or downloaded to your desktop. Some other good examples are:
- Ajax refresh of the logs
- Quick column navigation on the result page
- MapReduce jobs listing with a direct access to their logs
- ‘Email me on completion’ setting
- Multi-database support
Example of the screen while running query:
Seeing the result of the query:
Note: if your queries are failing and you are seeing an error like below, it means that you forgot to add the ‘/user/flume/hive-serdes-1.0-SNAPSHOT.jar’ to the query:
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask
In this post we focused on how the Beeswax application can make it easy to execute Hive queries. New features such as multi-query (HUE-159), autocomplete, and syntax highlighting (HUE-1063) are going to improve the usability even more.
The next article in this series will elaborate on this topic and describe how Hue’s Apache Oozie application can be used for scheduling Hive queries in a few clicks.
Romain Rigaux is a Software Engineer on the Platform team.