How-to: Analyze Twitter Data with Hue

Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.

This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.

Collecting Data

The first step is to create the “flume” user and his home on the HDFS where the data will be stored. This can be done via the User Admin application.

The second step consists of collecting some tweet data from the live Twitter stream.

Apache Flume is an elegant solution for taking care of this. The configuration of Flume is detailed in the readme and previous blog post. However, if you want to skip this step, some data is available on GitHub. Just upload it as a zip file in the home directory of the flume user and the “tweets” directory will show up after a few seconds.

If you are not taking this shortcut, create the tweets directory in the File Browser with the New Folder action.

Then, when the Flume agent is started, the data will start appearing:

Clicking on a file will display its content in the built-in viewer:

Preparing Hive

It is time to prepare the analysis of the tweet data. We’ll use Apache Hive, which can query the data with SQL-like syntax in a scalable way. The detailed description of the Hive setup is detailed in the readme.

When Hive is ready, the tweet table can be created in the query editor of Beeswax. Notice that the Hive SerDe (to download or compile here) must be included as a jar in the query. You can read more about Hive SerDe in this previous post.

To do this, just click on “Add” > “File Resources”, click on the path chooser button, click on the “Home” button, and upload hive-serdes-1.0-SNAPSHOT.jar.

Then just enter the CREATE TABLE statement and execute it:

CREATE EXTERNAL TABLE tweets (
  id BIGINT,
  created_at STRING,
  source STRING,
  favorited BOOLEAN,
  retweet_count INT,
  retweeted_status STRUCT<
    text:STRING,
    user:STRUCT<screen_name:STRING,name:STRING>>,
  entities STRUCT<
    urls:ARRAY<STRUCT<expanded_url:STRING>>,
    user_mentions:ARRAY<STRUCT<screen_name:STRING,name:STRING>>,
    hashtags:ARRAY<STRUCT<text:STRING>>>,

  text STRING,
  user STRUCT<
    screen_name:STRING,
    name:STRING,
    friends_count:INT,
    followers_count:INT,
    statuses_count:INT,
    verified:BOOLEAN,
    utc_offset:INT,
    time_zone:STRING>,
  in_reply_to_screen_name STRING
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets'

 

Now that the table is created, let’s insert some data in the table. First, select the table in the “Table” tab and click “Import data”. Enter the path “/user/flume/tweets/2013/02/25/17″ and “201302251″ as the key:

Depending on the partition picked, a query similar to this will be generated:

LOAD DATA INPATH '/user/flume/tweets/2013/02/25/16'
INTO TABLE `default.tweets`
PARTITION (datehour='2013022516')

 

After the query executes, the table ‘tweets’ will be available.

Beeswax can access the Hive metastore and its list of tables. A description of their schema and partitions with some example of data contained in each table are helpful while designing your queries. Moreover, a wizard can guide you step-by-step to create new tables.

 

Analysis with Beeswax

It becomes now possible to perform some SELECT queries on the data. Here is an example below but most of interesting ones are described in Parts 1 and 3 of the “Analyzing Twitter with Hadoop” series.

SELECT
    t.retweeted_screen_name,
    sum(retweets) AS total_retweets,
    count(*) AS tweet_count
  FROM (SELECT
          retweeted_status.user.screen_name as retweeted_screen_name,
              retweeted_status.text,
              max(retweet_count) as retweets
        FROM tweets
        GROUP BY retweeted_status.user.screen_name,
                 retweeted_status.text) t
  GROUP BY t.retweeted_screen_name
  ORDER BY total_retweets DESC
  LIMIT 10;

 

Beeswax possesses multiple features for providing a better user experience than the command line shell. For example you can save queries and share them with other users. The result of a query can be exported into a new table or an HDFS file or downloaded to your desktop. Some other good examples are:

  • Ajax refresh of the logs
  • Quick column navigation on the result page
  • MapReduce jobs listing with a direct access to their logs
  • ‘Email me on completion’ setting
  • Multi-database support

Example of the screen while running query:

Seeing the result of the query:

Note: if your queries are failing and you are seeing an error like below, it means that you forgot to add the ‘/user/flume/hive-serdes-1.0-SNAPSHOT.jar’ to the query:

FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

 

Conclusion

In this post we focused on how the Beeswax application can make it easy to execute Hive queries. New features such as multi-query (HUE-159), autocomplete, and syntax highlighting (HUE-1063) are going to improve the usability even more.

The next article in this series will elaborate on this topic and describe how Hue’s Apache Oozie application can be used for scheduling Hive queries in a few clicks.

Thank you for reading and feel free to post comments here or on the hue-user list. We also hope to see you at the first Hue meetup (this Wednesday, March 27)!

Romain Rigaux is a Software Engineer on the Platform team.

17 Responses
  • Hendra Budiawan / March 26, 2013 / 4:03 AM

    Hello, great article but i have problem with the query. Everytime i execute it, i have this error
    ====================================
    Your query has the following error(s):

    OK converting to local hdfs://single.example.com:8020/user/flume/flume-sources-1.0-SNAPSHOT.jar Added /tmp/hue/hive_resources/flume-sources-1.0-SNAPSHOT.jar to class path Added resource: /tmp/hue/hive_resources/flume-sources-1.0-SNAPSHOT.jar FAILED: ParseException line 9:25 mismatched input ‘>’ expecting < near 'STRUCT' in struct type
    ====================================

    I have following your steps, but still no lucks. Can you give some suggestions to resolved this issue?

  • Romain Rigaux / March 26, 2013 / 10:19 AM

    The query was eaten up during the publication of the post. Sorry, we just fixed it!

  • Hendra Budiawan / March 26, 2013 / 7:35 PM

    Great, thank you for revition i can go through to the next step now. Thank you so much.

  • Dee / May 08, 2013 / 9:29 AM

    I am having problems getting past the first step(creating flume user). The error message I get is “Cannot make home directory for user XXXX”. This happens when I try to create a user and I check the ‘create home directory’ check box like the direction says. Please help me figure out the problem

  • Romain / May 08, 2013 / 9:36 AM

    This probably means that WebHdfs is not currently configured. Is the File Browser application working? Can you create a file from there? You should have more logs on http://{HUE_SERVER}/logs

    More information about how to configure HDFS for Hue:
    http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.0/CDH4-Installation-Guide/cdh4ig_topic_15_4.html

  • ignorant / June 24, 2013 / 11:32 PM

    Quick question: if you have more data, would this table get updated automatically?
    If so, is it safe to change -
    LOAD DATA INPATH ‘/user/flume/tweets/2013/02/25/16′
    to -
    LOAD DATA INPATH ‘/user/flume/tweets’

  • Romain Rigaux / June 25, 2013 / 2:02 PM

    This will update it but it won’t be partitioned so you will be losing some performance and the conveniency of the partitions.

  • ignorant / June 25, 2013 / 7:32 PM

    Thanks, I will try the partition approach.

    FYI, using ‘/user/flume/tweets’ did not work because when upload is pressed, it moves data from this location to something else. So it does not get updated automatically.

    I will play with oozie as suggested in the blog.

  • billou2k / August 08, 2013 / 3:52 AM

    Hive seems to have issues with tweets that have “\n” in their text. Line breaks have been enabled in Twitter back in March, and every time there is one, Hive creates a new row with NULL values in it.
    Is there an serde or table option to disable the \n being interpreted as a line break in the json data?

  • Romain Rigaux / August 08, 2013 / 8:42 AM

    The “\n” would probably need to be escaped in the Flume agent which is collecting them: https://github.com/romainr/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java#L113

    Will see for updating the code shortly, thanks for the report!

    Another alternative would be to change the new line separator set by Flume.

  • khathu / October 24, 2013 / 5:17 AM

    Hive seems to have issues with tweets that have “\n” in their text. Line breaks have been enabled in Twitter back in March, and every time there is one, Hive creates a new row with NULL values in it.
    Is there an serde or table option to disable the \n being interpreted as a line break in the json data?
    ————————
    hi, is there any updated SerDe jar file to deal with the above problem.

    if tweets data is already in the HDFS how do you remove “\n” in their text such that i do not have a Line break problem when im reading data using Hive.

  • Romain Rigaux / October 25, 2013 / 6:17 AM

    Some people modified the parseField() function and found that replacing “\\\\n” by ” ” fixes the linebreak issue.”

    https://github.com/romainr/cdh-twitter-example/blob/master/hive-serdes/src/main/java/com/cloudera/hive/serde/JSONSerDe.java#L135

  • khathu / October 29, 2013 / 6:19 AM

    im failing to understand were exactly is “\\\\n” in the parseField() function and how to do i replace it.

    Help

    private Object parseField(Object field, TypeInfo fieldTypeInfo) {
    switch (fieldTypeInfo.getCategory()) {
    case PRIMITIVE:
    // Jackson will return the right thing in this case, so just return
    // the object
    return field;
    case LIST:
    return parseList(field, (ListTypeInfo) fieldTypeInfo);
    case MAP:
    return parseMap(field, (MapTypeInfo) fieldTypeInfo);
    case STRUCT:
    return parseStruct(field, (StructTypeInfo) fieldTypeInfo);
    case UNION:
    // Unsupported by JSON
    default:
    return null;
    }
    }

  • Prince / October 31, 2013 / 7:04 AM

    Hello,

    I am having a problem running a query: SELECT count(*) FROM tablename1;

    i get the error: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

    NOTE: i did include file:: hive-serdes-1.0-SNAPSHOT.jar

    But when i run: SELECT * FROM tablename1; it works very well.

    The problem is: SELECT COUNT(*) FROM tablename1

  • Romain Rigaux / October 31, 2013 / 10:06 AM

    There is something misconfigured when doing some MapReduce jobs (COIUNT(*), LIMIT… will trigger MapReduce jobs whereas a simple SELECT * will just read the data).

    On the left, you should have a link to a MapReduce job. Can you click on it and look for errors in the logs via the Job Browser app?

    A common problem is missing the HDFS home directory of your user.

  • PRINCE / November 04, 2013 / 12:40 AM

    Hi ROMAIN,

    The HDFS home directory of my user is not missing, but i have found the following errors on logs. May you help me understand.

    Caused by: org.codehaus.jackson.JsonParseException: Unexpected end-of-input within/between OBJECT entries

    ————————————————————————-
    Diagnostic Log:
    ———————–

    java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing writable {“filter_level”:”medium”,”retweeted_status”:{“contributors”:null,”text”:”BON MTN CHUI BALON DORE OU PA ?!”,”geo”:null,”retweeted”:false,”in_reply_to_screen_name”:null,”truncated”:false,”lang”:”fr”,”entities”:{“symbols”:[],”urls”:[],”hashtags”:[],”user_mentions”:[]},”in_reply_to_status_id_str”:null,”id”:390192476503302144,”source”:”web”,”in_reply_to_user_id_str”:null,”favorited”:false,”in_reply_to_status_id”:null,”retweet_count”:293,”created_at”:”Tue Oct 15 19:08:30 +0000 2013″,”in_reply_to_user_id”:null,”favorite_count”:26,”id_str”:”390192476503302144″,”place”:null,”user”:{“location”:”Chez Zahia”,”default_profile”:false,”statuses_count”:12610,”profile_background_tile”:true,”lang”:”fr”,”profile_link_color”:”0004FF”,”profile_banner_url”:”https://pbs.twimg.com/profile_banners/1292691356/1374518585″,”id”:1292691356,”following”:null,”favourites_count”:288,”protected”:false,”profile_text_color”

    ———————————————————————-

Leave a comment


× seven = 28