This post was authored by Dmitry Chechik, a software engineer at TellApart, the leading Customer Data platform for large online retailers.
Apache Hadoop is widely used for log processing at scale. The ability to ingest, process, and analyze terabytes of log data has led to myriad applications and insights. As applications grow in sophistication, so does the amount and variety of the log data being produced. At TellApart, we track tens of millions of user events per day, and have built a flexible system atop HBase for storing and analyzing these types of logs offline.
A TellApart user planning a bird-watching trip may start her day searching for binoculars on Binoculars.com, continue to comparison-shop for new hiking pants on one of our other partner merchants, and be shown relevant ads to these interests throughout her experience. Her browsing activity produces a flurry of different log data: page views, transactions, ad impressions, ad clicks, real-time ad auction bid request, and many more. Dissecting this data is a common scenario – and a real challenge – faced by many log analysis applications.
Many of these scenarios share a common set of requirements:
- Data must be ingested into the system incrementally – one day or so worth of data at a time.
- Data is processed at a variety of time scales. Daily reporting often cares only about one day’s worth of data, while machine learning applications may require digesting several months worth of data to build models.
- Some events are naturally associated with others. An ad click is logged separately from an ad impression, but the two need to be processed together. Some data extraction applications need to process these associated events together, but others only care about the individual events.
- Random-access lookups to track all the actions of a user across time are often helpful. This is a powerful debugging tool to understand how the user interacts with the web at large and with the TellApart system in particular.
We’ve found that Apache HBase is uniquely well suited to many of these tasks. By organizing user events across rows and columns, and using the various built-in HBase server-side filters, we can slice the data across the different required dimensions. The first decision to make when setting up an HBase system is the schema to use. We broke down the data as follows:
|Row Key Prefix||Row Key Suffix||Ad impression||Associated Ad Click||Associated real-time auction bid|
|User ID||<timestamp>_<event_id>||Thrift Buffer||Thrift Buffer||Thrift Buffer|
This organization of data, coupled with HBase’s abilities, allows for powerful methods of accessing the data ? both for reading and writing data. Consider importing data, one day at a time, into HBase. This can be done with a simple Map-Reduce that writes data into the appropriate rows, which is about the same as a typical Hadoop implementation. However, if the data to be imported is associated with some other type of data through an event id (in our example, these are ad clicks, each of which is associated with an impression), we simply need to write the clicks into the row the associated impression is in. This can be done with a map-only Map-Reduce, so we can avoid an unnecessary reduce step to join clicks to impressions. The same basic idea of associating related events in the same row can be extended to encompass the outputs of analysis jobs. For example, if an offline analysis determines that a given impression leads to an action on an advertiser’s site, we can output data about that action into the same row. Subsequent applications don’t need to repeat this analysis to obtain that data.
Judicious use of filters and column selection lets us control which parts of the data we need to read for any specific type of operation. Consider a task where we only need to read data from the past month. Since the timestamp is embedded in the row key, we can use a Map-Reduce with an HBase RowFilter with a custom Comparator. The filter will only accept rows within our date range. An alternative here is to use the built-in timestamp associated with each HBase cell to do this kind of filtering. Doing this reduces the amount of data the mappers have to process, and significantly speeds up the jobs. Another way to slice data by rows is based simply on the prefix of the rows. In our use case, the user ID is the prefix of each row. With randomly-generated user IDs, if we want to run a Map-Reduce that analyses a sample of all users, we can do so by using a PrefixFilter, which will only accept rows with a given prefix.
We can also use columns to select the set of data we’d like to read. For some tasks, we need to read both the impressions and the bid requests associated with each impression. For others, we only need to read impressions. By selecting the columns to read, we can limit the input set of data. Going further, we can even limit the data based on the values of some columns. TellApart serves impressions that show ads for various advertisers. By storing the advertiser id in a separate “advertiser_id” column, and using a SingleValueColumnFilter, we can retrieve only rows for a particular advertiser if that’s all we care about.
Aside from Map-Reduce applications, HBase is also great for random lookups of data. We often need to see all of a particular user’s log actions over time, for debugging our system. This is where HBase really shines – we issue a Scan for the range of keys starting with a given user ID. This gives us instant visibility into the data, and we can use the same column-based filters to limit the data further if necessary. We issue these lookups in the HBase Shell, and to make this easy, we’ve modified the HBase Shell to deserialize and format the data we store in HBase to make it readable for us.
We’ve also made it easier for HBase to play nice with Cascading and Thrift. To use HBase with Cascading, we built on top of the existing Cascading.HBase support, adding support for any Hadoop Serialization to the module. This lets us read and write Thrift objects (or any other Hadoop Writable) from HBase in our Cascading jobs, much in the same way we would for any Hadoop job. The code is available at TellApart-Hadoop-Utils.
HBase’s flexibility has been a boon to our offline analysis. Separating the data by rows and columns affords us tremendous flexibility in our data access patterns, and gives us a strong platform to grow on in the future.
Again, this post was authored by Dmitry Chechik, a software engineer at TellApart, the leading Customer Data platform for large online retailers. Thanks to Mark Ayzenshtat, TellApart CTO and co-founder, for reading drafts of this post. TellApart is actively hiring great software engineers. Click here to learn more about the company and their open positions.