This post is courtesy of Greg Poulos, a software engineer at Rapleaf.
At Rapleaf, our mission is to help businesses and developers create more personalized experiences for their customers. To this end, we offer a Personalization API that you can use to get useful information about your users: query our API with an email address and we’ll return a JSON object containing data about that person’s age, gender, location, their interests, and potentially much more. With this data, you could, for example, build a recommendation engine into your site. Or send out emails tailored specifically to your users’ demographics and interests. You get the idea.
The main product we offer is an API, but Rapleaf is a data company at heart: our API is backed by a massive store of consumer data that comes from a wide variety of sources. We have over a billion email addresses in our system, our main datastore is on the order of terabytes of data, and we need to be able to normalize, analyze, and package this data on a regular basis. How do we manage this? With a 200-node Hadoop cluster.
The Olden Days
Once upon a time, Rapleaf used RDBMSes to write, update, and read all this data in real time. Specifically, there was lots of MySQL lots and lots and lots of MySQL. This system worked okay, initially. But as we scaled up our operations, we found ourselves having to build complex asynchronous processing, doing master/slave replication, and sharding systems in order to keep up performance. This wasn’t working well because MySQL simply couldn’t scale with us.
Hadoop has allowed Rapleaf to work with data at scale much more easily than any RDBMS would have. (Specifically, we’re currently using CDH3.) We started working with Hadoop back when it was still a relatively young project, and we found ourselves pushing the framework to its limits. Cloudera’s assistance was invaluable in optimizing our Hadoop setup and neutralizing any bugs we ran into. Now Rapleaf’s main processing pipeline is a batch-oriented process consisting entirely of Hadoop workflows.
Ingesting New Data
Data ingestion begins when we need to integrate new data into our system. Rapleaf gets data from a wide variety of customers and companies. We use Hadoop to process the relevant input files, writing them to unprocessed data stores on HDFS. These initial stores are completely raw: the data has not yet been normalized, and may even be objectively bad—an impossible date (3/32/11), for example, or an email address with invalid syntax (a@b@firstname.lastname@example.org).
The first step in our analysis pipeline is ingestion, a Hadoop workflow that regularly runs over the newest batch of raw data. It separates out the invalid data and normalizes everything else into a canonical form. For instance, we might start out with a “raw age” data unit containing the string “23”; after ingestion, we will have a “normalized age” data unit containing the number 23. Once ingestion has produced a “clean” store of validated and normalized data, we can get cracking on the really interesting stuff.
When we import our data, it is always tied to some kind of identifier—a name, email address, or the like—because otherwise it would be just a useless tidbit of data aimlessly floating amidst a vast sea of information. As you might imagine, our data becomes much more powerful if we’re able to link these identifiers together. Think about it this way: if I know that the emails “email@example.com” and “firstname.lastname@example.org” belong to the same person, any data I have associated with either email can be grouped together as applying to the same underlying “entity”. Instead of knowing a little bit about two separate emails, I now know a whole bunch about the person who owns both of them—a dude who happens to be named Greg Poulos.
Using Hadoop, Rapleaf has implemented a novel distributed graph algorithm that resolves these equivalences between identifiers and tags different data units as belonging to the same entity. And this is just one step of the heavy-duty analysis program all our data gets subjected to. When appropriate, we do data inference (for example, it’s often possible to infer gender from first names with a high degree of confidence), and when data from different sources conflict, we intelligently decide what the “canonical” data for a given entity will be.
Serving the Data
After our analysis workflow is complete and the data is summed up with respect to a given entity, we need to package the data into easily-served objects. No prizes for guessing how we do this packaging. (Hint: you might have noticed that we’ve used Hadoop to do pretty much everything else up to this point.)
To serve all this data, Rapleaf has developed a unique distributed key-value storage system that we’ve recently open-sourced. It’s called Hank, it’s designed for batch updates, and it’s lightning-fast. So we load our summarized data packs into Hank, at which point everything is ready for our front-end web servers to serve up. Hadoop’s job is complete, and it has performed admirably.
None of this would have really been feasible without Hadoop. The amount of data is just too large, and the rigid schema of a traditional RDBMS isn’t really appropriate for the huge variety of data Rapleaf deals with. Now our scheme can be easily changed, allowing us to do ad hoc analyses that are far more efficient than operating in MySQL. By designing our systems around large, batch-oriented processes, we’ve been able to scale up our operations orders-of-magnitude beyond what would have been possible with systems that support real-time updates.