How-to: Read FIX Messages Using Apache Hive and Impala

Categories: Hadoop Hive How-to Impala

Learn how to read FIX message files directly with Hive, create a view to simplify user queries, and use a flattened Apache Parquet table to enable fast user queries with Impala.

The Financial Information eXchange (FIX) protocol is used widely by the financial services industry to communicate various trading-related activities. Each FIX message is a record that represents an action by a financial party, such as a new order or an execution report. As the raw point of truth for much of the trading activity of a financial firm, it makes sense that FIX messages are an obvious data source for analytics and reporting in Apache Hadoop.

Apache Hive, the versatile SQL engine that runs on Hadoop, can read FIX messages without any code or pre-processing. Thus users can easily load FIX message files into Hadoop and immediately start executing queries. Furthermore, with a simple data set conversion, users can also run their queries against Impala, the open source analytic database for Hadoop. Cloudera’s open platform lets you do both types of queries inside the Hue GUI so that a command-line session is not required. This blog post will show you how.

If you do not already have a Hadoop cluster but would like to run these steps, the Cloudera QuickStart VM allows you to run a pre-made single-node Hadoop cluster on your own computer. Or, use Cloudera Live to access a full Cloudera Enterprise cluster for a free two-week period.

Loading Data and Creating Tables

First, let’s review the structure of a FIX message.

FIX messages are formatted as a set of key-value pairs on a single line. Each key-value pair contains an attribute (known as a tag) of the message, such as the message type or order identifier. The key-value pairs are separated by the ASCII 001 “start of heading” character (commonly visualized as “^A”), and the keys and values are separated by the equal-sign character (=).

For example:

Armed with one or more files containing FIX messages, one per line, you can push them into Hadoop by uploading them into an HDFS directory using the Hue File Browser:

You can create a Hive table over the top of the FIX message files using a Hive CREATE TABLE statement.

Hive is sufficiently flexible to allow expression of each record as a collection of key-value pairs, using the MAP data type. Thus, you can simply reference the key for a message and Hive will return the value. Although key in the FIX message format is an integer (INT), the values can have varying data types, so you would use strings (STRING) to capture any value.

Delimiters are specified using the TERMINATED BY clause. You express the collection item delimiter as ASCII 001 using ‘1’, and the map key delimiter as the equals sign using ‘=’. The default field delimiter is ASCII 001, so you change it to the placeholder ASCII 002 to avoid ambiguity. You are only defining one field in each FIX record—the single MAP collection of key-value pairs—so the field delimiter will not be found in the files.

An external Hive table is used and this maps the table to the location of the uploaded FIX files. Dropping an external table does not delete the underlying data.

This statement can be run in the Hue Hive Editor after replacing the location with the HDFS path of the uploaded FIX files.

Queries

With the data loaded and the table created, you can start to query the FIX records. Use the single ‘tag’ column to reference the various tags of the messages.

This feature is powerful because you can reference any tag identifier without the need to pre-define it. The downside, however, is that you need to recall tag numbers to query the data.

You can make querying easier by first creating a Hive view over the table that renames the tag references, and then reading from the view instead of the table. This approach may also be a good opportunity to cast specific tags to specific types so that users don’t have to do that themselves.

(Find a pre-made view of all known tags here. Use them all, or edit the list to include only the desired tags. Tags referenced but not found in the data will return as NULL.)

Optimizing with Impala

At this point, you’re in a good spot: You can query the raw FIX message files and reference tags using sensible field names. You could stop there and functionally have all that you need, but if you want to go a bit further, you can improve query performance by converting the data into a more read-optimized format.

This process involves two transformations:

  • Converting the data from records of collections of key-value pairs to records of plain columns. This approach allows you to use Impala instead of Hive for user queries—which is much faster, but in this case, does not support the MAP data type. This conversion will also speed up Hive queries by de-serializing the MAP key-value pairs in advance.
  • Converting the data from plain text files into Apache Parquet files, which is a column-oriented binary format that considerably improves query performance over delimited text files.

You can use Hive to do both of these tasks in a single query, and most of the work has already been done by using the view created earlier:

Switching to the Hue Impala Editor, let Impala know about the new table via INVALIDATE METADATA fix; and query it like you did with the view. However, as “symbol” is a reserved word in Impala, you need to wrap it in back-ticks:

Conclusion

In this post, you’ve learned how easy it is use Hadoop to query FIX message datasets with SQL, and that Hive is a good choice for preparing and optimizing data while Impala is a good choice for running analytic queries. This same pattern can be applied to a wide variety of other datasets, which can all be integrated together in Hadoop to discover new insights about your enterprise. Happy querying!

Jeremy Beard is a Senior Solutions Architect at Cloudera.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

10 responses on “How-to: Read FIX Messages Using Apache Hive and Impala

  1. Kiril

    Hi,

    can you point out a sample FIX dataset to try this demo in our test env?

    Thanks.

  2. Jeremy Beard

    Hi Jlroo,

    Thanks for the pointing out a better FIX data source.

    I believe you may be using the Hue Metastore Manager to try to create the table? If so, I don’t believe that will work for FIX format because it is not a flat CSV-like structure. Instead you can use the Hue Hive Editor to submit the CREATE TABLE statement. The Hue Hive Editor is in Hue under Query Editors and then Hive. If that doesn’t work we are happy to further help on community.cloudera.com!

    Jeremy

  3. Trevor

    Hi
    Is there a good way to create the external HIVE table where the FIX message has repeating groups?

    Example: With something like the following [simplified] FIX message, tags 311 and 307 are within a repeating group
    8=FIX.4.2^A9=145^A35=d^A34=4^A49=ABC_DEFG01^A52=20090323-15:40:29^A56=CCG^A146=3^A311=111111^A307=Sec Def 1^A311=222222^A307=Sec Def 2^A311=333333^A307=Sec Def 3^A10=139^A

    So the difficulty here is that creating the hive table as suggested and querying tags 311 and 307 will only result in the first in the group being returned rather than all values in the group

    Thanks for your help

  4. pooja

    Hi Gurus,
    I tried above example in my local lab. when I run below create statement i get error:

    hive>
    > CREATE EXTERNAL TABLE fix_map
    > (tag MAP)
    > ROW FORMAT DELIMITED
    > COLLECTION ITEMS TERMINATED BY ‘1’
    > FIELDS TERMINATED BY ‘2’
    > MAP KEYS TERMINATED BY ‘=’
    > LOCATION ‘/user/pooja/fix/’;
    NoViableAltException(105@[1704:103: ( tableRowFormatMapKeysIdentifier )?])
    at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
    at org.antlr.runtime.DFA.predict(DFA.java:116)
    at org.apache.hadoop.hive.ql.parse.HiveParser.rowFormatDelimited(HiveParser.java:30427)
    at org.apache.hadoop.hive.ql.parse.HiveParser.tableRowFormat(HiveParser.java:30662)
    at org.apache.hadoop.hive.ql.parse.HiveParser.createTableStatement(HiveParser.java:4683)
    at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2144)
    at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1398)
    at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1036)
    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:199)
    at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:404)
    at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:322)
    at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:975)
    at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1040)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911)
    at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)
    at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
    at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
    at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
    at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:792)
    at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
    at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
    FAILED: ParseException line 5:0 cannot recognize input near ‘FIELDS’ ‘TERMINATED’ ‘BY’ in serde properties specification
    hive>

    Also I tried below statement table is cretaed but i can not see any data its coming just NULL..

    > CREATE EXTERNAL TABLE fix_map
    > (tag MAP)
    > ROW FORMAT DELIMITED
    > FIELDS TERMINATED BY ‘2’
    > COLLECTION ITEMS TERMINATED BY ‘1’
    > MAP KEYS TERMINATED BY ‘=’
    > LOCATION ‘/pooja/fixmsg/FIX_Message/’;
    OK
    Time taken: 0.098 seconds
    hive>

    Could you please help resolving the issue.

    Regards,
    Pooja

    1. pooja

      in my above comment where I was able to create table successfully i can not see data it shows NULL values only:

      hive> select TAG[38] from fix_map1;
      Total jobs = 1
      Launching Job 1 out of 1
      Number of reduce tasks is set to 0 since there’s no reduce operator
      Starting Job = job_1480456609737_0007, Tracking URL = http://localhost:8088/proxy/application_1480456609737_0007/
      Kill Command = /usr/lib/hadoop-2.2.0/bin/hadoop job -kill job_1480456609737_0007
      Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
      2016-11-30 04:57:10,549 Stage-1 map = 0%, reduce = 0%
      2016-11-30 04:57:19,753 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.87 sec
      MapReduce Total cumulative CPU time: 1 seconds 870 msec
      Ended Job = job_1480456609737_0007
      MapReduce Jobs Launched:
      Job 0: Map: 1 Cumulative CPU: 1.87 sec HDFS Read: 7009 HDFS Write: 78 SUCCESS
      Total MapReduce CPU Time Spent: 1 seconds 870 msec
      OK
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      NULL
      Time taken: 22.181 seconds, Fetched: 26 row(s)
      hive>

      Could you please help me out here please.
      Regards,
      Pooja

  5. Trevor

    Ah – my comment didn’t show the ‘left angle bracket’ INT, STRING ‘right angle bracket
    so perhaps it was in your snippet as well