How-to: Use a SerDe in Apache Hive

Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.

What is a SerDe?

The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute SELECT statements, and Serializers are used when writing data, such as through an INSERT-SELECT statement.

In this article, we will examine a SerDe for processing JSON data, which can be used to transform a JSON record into something that Hive can process.

Developing a SerDe

To start, we can write a basic template for a SerDe, which utilizes the Hive serde2 API (org.apache.hadoop.hive.serde2). This API should be used in favor of the older serde API, which has been deprecated:

Breaking this down a bit, the initialize() method is called only once and gathers some commonly-used pieces of information from the table properties, such as the column names and types. Using the type info of the row, you can instantiate an ObjectInspector for the row (ObjectInspectors are Hive objects that are used to describe and examine complex type hierarchies.) The two other important methods are serialize() and deserialize(), which do the namesake work of the SerDe.

In a SerDe, the serialize() method takes a Java object representing a row of data, and converts that object into a serialized representation of the row. The serialized class is determined by the return type of getSerializedClass(). In the JSONSerDe, the serialize() method converts the object into a JSON string represented by a Text object. To do the serialization from Java into JSON, I’ve opted to use the Jackson JSON library, which allows me to convert a Java object to a JSON string with just a small amount of code:

Jackson understands how to convert basic Java objects like Maps, Lists, and primitives into JSON strings. However, the Java object that is passed into the serialize() method is an internal Hive representation of a row, which Jackson can’t work with. The goal here is to use the ObjectInspector to interpret the Hive object, and convert it into a more basic Java representation.

In the JSONSerDe code, this process is broken up into a number of methods. The control flow is fairly simple, so let’s just examine some of the interesting pieces:

The deparseObject()method is nothing more than a fork in the road. ObjectInspectors have a category, which will identify the underlying subtype of the inspector.

In the deparseList() method, your goal is to translate a Hive list field into a Java array. In order to do this properly, you need to also deparse each of the list elements. Fortunately, you can obtain an ObjectInspector specifically for the list elements from a ListObjectInspector. You can follow this same pattern with all the other Hive data types to fully translate the object, and then let Jackson do the work of writing out a JSON object.

The opposite of serialize() is deserialize(). The deserialize() method takes a JSON string, and converts it into a Java object that Hive can process. Again, you can use Jackson to do most of the heavy lifting. Jackson will convert a JSON record into a Java Map with just a couple lines of code:

When deserializing, you need information from Hive about what type of data each field contains. You can use the TypeInfo API similarly to how we used ObjectInspectors while serializing. Looking again at handling a list type:

Also like ObjectInspectors, TypeInfos have subtypes. For a Hive list field, the TypeInfo is actually a ListTypeInfo, which we can use to also determine the type of the list elements. You can parse each list element one-by-one, and return the necessary array.

Using the SerDe

Tables can be configured to process data using a SerDe by specifying the SerDe to use at table creation time, or through the use of an ALTER TABLE statement. For example:

The bolded section of the above CREATE TABLE statement shows how a table is configured to use a SerDe. If the SerDe is not on the Hive classpath, it must be added at runtime using the ADD JARcommand. It should be noted that one limitation of the JSONSerDe is that the field names must match the JSON field names. JSON fields that are not present in the table will be ignored, and records that don’t have certain fields will return NULLs for any missing fields. As an example, the raw data that the above fields refer to looks like this:

Once the table is set up, querying complex data is as simple as SQL:

If you’re interested in getting the code to use the JSONSerDe in a real environment, see the CDH Twitter Example on the Cloudera Github.

Conclusion

The SerDe interface is extremely powerful for dealing with data with a complex schema. By utilizing SerDes, any dataset can be made queryable through Hive. For a full use case involving the JSONSerDe discussed in this article, see Analyzing Twitter with Apache Hadoop.

Jon Natkins is a Software Engineer at Cloudera, working on the Enterprise team.

Filed under:

1 Response
  • BaronTsai / December 04, 2013 / 3:39 AM

    i’m appreciate it that your blog is of great help. And i have a question for u: how can i get the value of “t0002″ when “t0000″=”ios”. Im looking forward to your earlist reply. Thanks.
    (Array<Map>)
    {
    “items”: [
    {
    "t0000": "ios",
    "t0001": "1.2.2",
    "t0002": "us",
    "t0003": true
    },
    {
    "t0000": "android",
    "t0001": "2.2.1",
    "t0002": "uk",
    "t0003": true
    },
    {
    "t0000": "symbian",
    "t0001": "3.3.3",
    "t0002": "ca",
    "t0003": false
    }
    ]
    }

Leave a comment


3 × = fifteen