How-to: Analyze Twitter Data with Hue

Categories: Data Science Flume Hive How-to Hue

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

Collecting Data

The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.

The second step consists of collecting some tweet data from the live Twitter stream.

Apache Flume is an elegant solution for taking care of this. The configuration of Flume is detailed in the readme and previous blog post. However, if you want to skip this step, some data is available on GitHub. Just upload it as a zip file in the home directory of the flume user and the “tweets” directory will show up after a few seconds.

If you are not taking this shortcut, create the tweets directory in the File Browser with the New Folder action.

Then, when the Flume agent is started, the data will start appearing:

Clicking on a file will display its content in the built-in viewer:

Preparing Hive

It is time to prepare the analysis of the tweet data. We’ll use Apache Hive, which can query the data with SQL-like syntax in a scalable way. The detailed description of the Hive setup is detailed in the readme.

When Hive is ready, the tweet table can be created in the query editor of Beeswax. Notice that the Hive SerDe (to download or compile here) must be included as a jar in the query. You can read more about Hive SerDe in this previous post.

To do this, just click on “Add” > “File Resources”, click on the path chooser button, click on the “Home” button, and upload hive-serdes-1.0-SNAPSHOT.jar.

Then just enter the CREATE TABLE statement and execute it:


Now that the table is created, let’s insert some data in the table. First, select the table in the “Table” tab and click “Import data”. Enter the path “/user/flume/tweets/2013/02/25/17” and “201302251” as the key:

Depending on the partition picked, a query similar to this will be generated:


After the query executes, the table ‘tweets’ will be available.

Beeswax can access the Hive metastore and its list of tables. A description of their schema and partitions with some example of data contained in each table are helpful while designing your queries. Moreover, a wizard can guide you step-by-step to create new tables.


Analysis with Beeswax

It becomes now possible to perform some SELECT queries on the data. Here is an example below but most of interesting ones are described in Parts 1 and 3 of the “Analyzing Twitter with Hadoop” series.


Beeswax possesses multiple features for providing a better user experience than the command line shell. For example you can save queries and share them with other users. The result of a query can be exported into a new table or an HDFS file or downloaded to your desktop. Some other good examples are:

  • Ajax refresh of the logs
  • Quick column navigation on the result page
  • MapReduce jobs listing with a direct access to their logs
  • ‘Email me on completion’ setting
  • Multi-database support

Example of the screen while running query:

Seeing the result of the query:

Note: if your queries are failing and you are seeing an error like below, it means that you forgot to add the ‘/user/flume/hive-serdes-1.0-SNAPSHOT.jar’ to the query:



In this post we focused on how the Beeswax application can make it easy to execute Hive queries. New features such as multi-query (HUE-159), autocomplete, and syntax highlighting (HUE-1063) are going to improve the usability even more.

The next article in this series will elaborate on this topic and describe how Hue’s Apache Oozie application can be used for scheduling Hive queries in a few clicks.

Thank you for reading and feel free to post comments here or on the hue-user list. We also hope to see you at the first Hue meetup (this Wednesday, March 27)!

Romain Rigaux is a Software Engineer on the Platform team.


17 responses on “How-to: Analyze Twitter Data with Hue

  1. Hendra Budiawan

    Hello, great article but i have problem with the query. Everytime i execute it, i have this error
    Your query has the following error(s):

    OK converting to local hdfs:// Added /tmp/hue/hive_resources/flume-sources-1.0-SNAPSHOT.jar to class path Added resource: /tmp/hue/hive_resources/flume-sources-1.0-SNAPSHOT.jar FAILED: ParseException line 9:25 mismatched input ‘>’ expecting < near 'STRUCT' in struct type

    I have following your steps, but still no lucks. Can you give some suggestions to resolved this issue?

  2. Romain Rigaux Post author

    The query was eaten up during the publication of the post. Sorry, we just fixed it!

  3. Hendra Budiawan

    Great, thank you for revition i can go through to the next step now. Thank you so much.

  4. Dee

    I am having problems getting past the first step(creating flume user). The error message I get is “Cannot make home directory for user XXXX”. This happens when I try to create a user and I check the ‘create home directory’ check box like the direction says. Please help me figure out the problem

  5. ignorant

    Quick question: if you have more data, would this table get updated automatically?
    If so, is it safe to change –
    LOAD DATA INPATH ‘/user/flume/tweets/2013/02/25/16’
    to –
    LOAD DATA INPATH ‘/user/flume/tweets’

  6. Romain Rigaux Post author

    This will update it but it won’t be partitioned so you will be losing some performance and the conveniency of the partitions.

  7. ignorant

    Thanks, I will try the partition approach.

    FYI, using ‘/user/flume/tweets’ did not work because when upload is pressed, it moves data from this location to something else. So it does not get updated automatically.

    I will play with oozie as suggested in the blog.

  8. billou2k

    Hive seems to have issues with tweets that have “\n” in their text. Line breaks have been enabled in Twitter back in March, and every time there is one, Hive creates a new row with NULL values in it.
    Is there an serde or table option to disable the \n being interpreted as a line break in the json data?

  9. khathu

    Hive seems to have issues with tweets that have “\n” in their text. Line breaks have been enabled in Twitter back in March, and every time there is one, Hive creates a new row with NULL values in it.
    Is there an serde or table option to disable the \n being interpreted as a line break in the json data?
    hi, is there any updated SerDe jar file to deal with the above problem.

    if tweets data is already in the HDFS how do you remove “\n” in their text such that i do not have a Line break problem when im reading data using Hive.

  10. khathu

    im failing to understand were exactly is “\\\\n” in the parseField() function and how to do i replace it.


    private Object parseField(Object field, TypeInfo fieldTypeInfo) {
    switch (fieldTypeInfo.getCategory()) {
    case PRIMITIVE:
    // Jackson will return the right thing in this case, so just return
    // the object
    return field;
    case LIST:
    return parseList(field, (ListTypeInfo) fieldTypeInfo);
    case MAP:
    return parseMap(field, (MapTypeInfo) fieldTypeInfo);
    case STRUCT:
    return parseStruct(field, (StructTypeInfo) fieldTypeInfo);
    case UNION:
    // Unsupported by JSON
    return null;

  11. Prince


    I am having a problem running a query: SELECT count(*) FROM tablename1;

    i get the error: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

    NOTE: i did include file:: hive-serdes-1.0-SNAPSHOT.jar

    But when i run: SELECT * FROM tablename1; it works very well.

    The problem is: SELECT COUNT(*) FROM tablename1

  12. Romain Rigaux Post author

    There is something misconfigured when doing some MapReduce jobs (COIUNT(*), LIMIT… will trigger MapReduce jobs whereas a simple SELECT * will just read the data).

    On the left, you should have a link to a MapReduce job. Can you click on it and look for errors in the logs via the Job Browser app?

    A common problem is missing the HDFS home directory of your user.

  13. PRINCE

    Hi ROMAIN,

    The HDFS home directory of my user is not missing, but i have found the following errors on logs. May you help me understand.

    Caused by: org.codehaus.jackson.JsonParseException: Unexpected end-of-input within/between OBJECT entries

    Diagnostic Log:

    java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable {“filter_level”:”medium”,”retweeted_status”:{“contributors”:null,”text”:”BON MTN CHUI BALON DORE OU PA ?!”,”geo”:null,”retweeted”:false,”in_reply_to_screen_name”:null,”truncated”:false,”lang”:”fr”,”entities”:{“symbols”:[],”urls”:[],”hashtags”:[],”user_mentions”:[]},”in_reply_to_status_id_str”:null,”id”:390192476503302144,”source”:”web”,”in_reply_to_user_id_str”:null,”favorited”:false,”in_reply_to_status_id”:null,”retweet_count”:293,”created_at”:”Tue Oct 15 19:08:30 +0000 2013″,”in_reply_to_user_id”:null,”favorite_count”:26,”id_str”:”390192476503302144″,”place”:null,”user”:{“location”:”Chez Zahia”,”default_profile”:false,”statuses_count”:12610,”profile_background_tile”:true,”lang”:”fr”,”profile_link_color”:”0004FF”,”profile_banner_url”:”″,”id”:1292691356,”following”:null,”favourites_count”:288,”protected”:false,”profile_text_color”