Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive

Categories: CDH Hadoop Hive How-to Use Case

This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.

In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.

In this article, we’ll learn more about Hive, its strengths and weaknesses, and why Hive is the right choice for analyzing tweets in this application.

Characterizing Data

One of the first questions to ask when deciding on the right tool for the job is: “what does my data look like?” If your data has a very strict schema, and it doesn’t deviate from that schema, maybe you should just be using a relational database. MySQL is just as free as Hive, and very effective for dealing with well-structured data. However, as you start to try to analyze data with less structure or with extremely high volume, systems like MySQL become less useful, and it may become necessary to move out of the relational world.

Unstructured, semi-structured, and poly-structured are all terms for data that doesn’t fit well into the relational model. This is data like JSON, XML, RDF, or other sorts of data with a schema that may vary from record to record. What do we do with this data? Here’s where Hive shines. Hive is extremely effective for dealing with data that doesn’t quite fit into the relational bucket, because it can process complex, nested types natively. Hive avoids the need for complicated transformations that might be otherwise necessary to handle this sort of data in a traditional relational system. Hive can also gracefully handle records that don’t strictly conform to a table’s schema. For example, if some columns are missing from a particular record, Hive can deal with the record by treating missing columns as NULLs.

In the Twitter analysis example, we loaded raw tweets into HDFS. Using the Twitter Streaming API, tweets are represented as JSON blobs.

It’s fairly easy to see that there is a good bit of complexity to this data structure. Since JSON can contain nested data structures, it becomes very hard to force JSON data into a standard relational schema. Processing JSON data in a relational database would likely require significant transformation, making the job much more cumbersome.

Looking at this particular bit of JSON, there are some very interesting fields: At the very top, there is a retweeted_status object, whose existence indicates that this tweet was retweeted by another user. If the tweet was not a retweet, you would not have a retweeted_status object at all. Tweets also contain an entities element, which is a nested structure. It contains three arrays, the elements of which are all nested structures in their own right, as can be seen in the hashtags array, which has two entries. How do you deal with a record like this in Hive?

Complex Data Structures

Hive has native support for a set of data structures that normally would either not exist in a relational database, or would require definition of custom types. There are all the usual players: integers, strings, floats, and the like, but the interesting ones are the more exotic maps, arrays, and structs. Maps and arrays work in a fairly intuitive way, similar to how they work in many scripting languages:

Structs are a little more complicated, since they are arbitrary structures, and a struct field can be queried much like an instance variable in a Java class:

To store the data for a tweet, arrays and structs will be crucial.

A Table for Tweets

Here is the table that was designed to store tweets, with some columns omitted:

By comparing the JSON objects from the tweet with the columns in the table, we can see how the JSON objects are mapped to Hive columns. Looking at the entities column, we can see what a particularly complex column might look line:

entities is a struct which contains three arrays, and each individual array stores elements which are also structs. If we wanted to query the screen names of the first mentioned user from each tweet, we could write a query like this:

If the user_mentions array is empty, Hive will just return NULL for that record.

The PARTITIONED BY clause utilizes a feature of Hive called partitioning, which allows tables to be split up in different directories. By building queries that involve the partitioning column, Hive can determine that certain directories cannot possibly contain results for a query. Partitioning allows Hive to skip the processing of entire directories at query time, which can improve query performance dramatically.

The LOCATION clause is a requirement when using EXTERNAL tables. By default, data for tables is stored in a directory located at /user/hive/warehouse/

. However, EXTERNAL tables can specify an alternate location where the table data resides, which works nicely if Flume is being used to place data in a predetermined location. EXTERNAL tables also differ from regular Hive tables, in that the table data will not be removed if the EXTERNAL table is dropped.

The ROW FORMAT clause is the most important one for this table. In simple datasets, the format will likely be DELIMITED, and we can specify the characters that terminate fields and records, if the defaults are not appropriate. However, for the tweets table, we’ve specified a SERDE.

Serializers and Deserializers

In Hive, SerDe is an abbreviation for Serializer and Deserializer, and is an interface used by Hive to determine how to process a record. Serializers and Deserializers operate in opposite ways. The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, on the other hand, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement. In the Twitter analysis example, we wrote a JSONSerDe, which can be used to transform a JSON record into something that Hive can process.

Putting It All Together

By utilizing the SerDe interface, we can instruct Hive to interpret data according to its inherent structure (or lack thereof). Since a SerDe is just a property of a Hive table, rather than the data, itself, we can also swap out SerDes as our data evolves. That flexibility allows us to choose the right tools for the job at hand, and to interpret data in different ways. It makes Hive a spectacular choice for getting quick access to data of all types.

In the first post in this series, we saw how we could use Hive to find influential users. Let’s look at some other queries we might want to write.

Geographic distributions of users can be interesting to look at. Unfortunately, the data I got from Twitter does not contain much of the geographic information necessary to plot really precise locations for users, but we can use time zones to get a sense of where in the world the users are. We can ask a question like, “Which time zones are the most active per day?”:

Interestingly, more users are tweeting about the selected terms on the east coast, than the west coast. Europe also seems to be pretty interested in big data.

We can also formulate more complex queries to ask questions like “Which were the most common hashtags?”:

Not surprisingly, several of the terms that I searched for when I was collecting data show up here. The first term that shows up, which I didn’t search for, is job, followed by jobs. Cloudera’s hiring, by the way. You may also notice the use of some non-standard SQL constructs, like LATERAL VIEW and EXPLODE. Lateral views are used when using functions like EXPLODE, which may generate more than one row of output for each row of input.

One Thing to Watch Out For…

If it looks like a duck, and it sounds like a duck, then it must be a duck, right? For users who are new to Hive, do not mistake Hive for a relational database. Hive looks a lot like a database, and you can interact with it very much like a database, but it should not be treated as such. Any query run in Hive is actually executed as a sequence of MapReduce jobs, which brings with it all of the performance implications of having to start up multiple JVMs. This means that all queries will have to pay a fixed setup cost, which will result in poor performance when running lightweight queries. This fact makes Hive particularly nice for executing batch workloads. Like MapReduce, Hive shines brightest when working with massive data sets. However, it is important to realize that Hive queries may not have a response time that can be considered interactive, and Hive will likely not serve as a replacement for a traditional analytical database.


In this article we’ve discussed some of the benefits and trade-offs of using Hive, and seen how to build a SerDe to process JSON data, without any preparation of the data. By using the powerful SerDe interface with Hive, we can process data that has a looser structure than would be possible in a relational database. This enables us to query and analyze traditional structured data, as well as semi- and even unstructured data.

Jon Natkins (@nattybnatkins) is a Software Engineer at Cloudera, where he has worked on Cloudera Manager and Hue, and has contributed to a variety of projects in the Apache Hadoop ecosystem. Prior to Cloudera, Jon wrangled databases at Vertica. He holds an Sc.B in Computer Science from Brown University.


4 responses on “Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive

  1. Douglas Moore

    I think the definition of a hashtag structure was left out the table definition above. Here is our table, and hopefully, the angle brackets won’t be munged by the blogging software:

    entities STRUCT<
    ROW FORMAT SERDE ‘com.cloudera.hive.serde.JSONSerDe’
    LOCATION ‘….’;

  2. SunitaKoppar

    I could get the table loaded using the SerDe. However, the parser is not working as expected. Or may be my JSON needs more formatting. Please suggest the best approach.

    I have a 36 KB data file with atleast 20 records. But when I do this:
    hive>select, position.title, descriptionSnippet from jobs;

    I get just this:
    Total MapReduce CPU Time Spent: 920 msec
    Management Science Associates Data Architect null
    Time taken: 10.016 seconds