This post was written by Daniel Jackoway following his internship at Cloudera during the summer of 2011.
When I started my internship at Cloudera, I knew almost nothing about systems programming or Apache Hadoop, so I had no idea what to expect. The most important lesson I learned is that structured data is great as long as it is perfect, with the addendum that it is rarely perfect.
My project was to develop a unified view of our customer data. The requirements were simple: pull in data from a variety of systems, group it by customer, and display it. The goal is that when someone at Cloudera needs to see all of the key information about our customers, it is available in one place. In addition, downloading and grouping data will make performing analysis much easier, allowing us to draw new insights about our business and our customers.
I started by writing a script for each data source to download the necessary data and insert it into an HBase table in raw form. Next I wrote a script for each data source that grouped the data by customer, possibly transformed the data (filtering, sorting, inserting child objects within their parents, etc) and inserted it into a separate HBase table where each row corresponds to a single customer, with a column family for each data source. Finally, I exposed the data on an internal website using Django, integrating the different data sources as much as possible.
One of the big challenges was turning the raw data into meaningful information. There were several discrepancies I needed to address with the data. As an example, companies have different names in different systems. Sometimes the difference is simply a matter of capitalization and/or spacing, but a greater challenge is that sometimes abbreviations were used in one system but not another, or one system ended the name with “, inc.” but another did not. I considered using fuzzy matching to solve all of these problems and realized that the Levenshtein distance between “Cloudera” and “Cloudera, inc.” is quite high, so I started looking at other forms of fuzzy matching and thinking of developing one that in particular favored long identical sub-strings. For example, I wanted my algorithm to see “Cloudera” and “Cloudera, inc.” as being highly similar for sharing the whole “Cloudera” part. As I contemplated embarking on a task to which I could have easily devoted the whole summer, I realized that I was heading down a rabbit hole. I took a step back and determined that trying to solve this problem in a fully automated way was not worth my time. It would have been time consuming, and it would have only made my problems worse; I still would have had to deal with names that should be merged but weren’t (since no scheme could perfectly determine if two names represent the same customer), but I also would have had to worry about names that shouldn’t have been merged but were. Why would I devote time to building a complex matching algorithm that doubled the number of problems I had to deal with?
Instead, I created an alias table in HBase. The key is the customer name, with white-space removed and letters lower-cased to catch the easiest cases. One of the columns contains a UUID that is used as the key for that customer throughout the rest of the system. When my transform scripts move data from the raw table to the table where each row is a customer, they use the alias table to determine into which row to insert the transformed data. When my code merges two customers, it merges the current data and makes all alias entries that were pointing to either row point to the newly merged row, so that when the scripts next load new data into the table, they put it directly into the correct location. This approach does require manual intervention (in practice, all schemes were going to), but at least it was simple. This was an important lesson from my internship; I learned that some problems aren’t worth solving.
Another major issue I had to tackle was cases where data was incorrect or incomplete. Our opportunity data, for example had various fields that were not used when the sytem was first configured. For example, contract terms were always a set period of time. Some fields such as the product quantity were changed so older records had a value but in different units than the units used in new records. For this tricky data rather than simply reading the values directly I wrote helper methods to return the value, sometimes trying 4 or 5 different ways to infer the actual value. For example, the contract end date, I had to base the value on close date about half the time. In this case one available helper method returned a tuple of the value that it was trying to infer (the end date) and a Boolean value representing whether the value returned was explicitly specified or inferred from another field. In the web view I used the explicit or inferred flag to add an annotation in the UI indicating when the value was approximated. Users could then look elsewhere if precision was important. This annotation also exposes where source data is missing fields, which can help us update the source data.
Businesses are always changing, so the data that a business decided to keep track of two years ago may not be the same data that makes sense to keep track of today. Many of the values I was trying to infer from data were from fields that we added to our system. The old data still lacked meaningful values because no one had gone back and fixed all of the past data. Each data source having a “name” field seems great until you realize that none of the names are the same, and having a field representing the exact information you want is great until you realize that it’s null in 40% of cases.
Another challenge was that I needed to interact with many different APIs, each with their own quirks, and similarly, I had to use various libraries to parse the different kinds of data and handle the transformations. I always expected APIs and libraries to be perfectly documented and to be designed with my usecase in mind. This was rarely the reality, so these tasks frequently took much longer than I expected. I knew from past experience that building good software always takes longer than planned, but it seemed even more true this summer. I realized that one factor contributing to this was that my whole project was centered around touching as many different “things” as possible. Each time I integrated with a new library, API, or existing codebase, there was always an additional cost of figuring out how to approach it. Additionally, there was always a chance that some aspect of the new process would not work as advertised or not provide a direct, optimized way for me to do what I needed. Significantly underestimating the difficulty of every single piece of my project helped me improve my ability to make those estimations.
Overall, my internship at Cloudera was amazing. I got to work with very smart people building and shipping high quality software using HBase. I got to sit next to, eat lunch with, and hear internal talks by people working on an array of fascinating things, happy to share knowledge and advice. I saw how a software company operates and caught glimpses into how diverse companiesClouderas customersoperate. At the beginning of my summer I’d thought my project was going to be a series of unexciting tasks, many of which I’ve done before. As it turned out I encountered some very interesting problems and a lot of good lessons to carry forward.
Find available opportunities via theCloudera Careers web page.