Apache Hadoop exists within a rich ecosystem of tools for processing and analyzing large data sets. At Facebook, my previous employer, we contributed a few projects of note to this ecosystem, all under the Apache 2.0 license:
- Thrift: A cross-language RPC framework that powers many of Facebook’s services, include search, ads, and chat. Among other things, Thrift defines a compact binary serialization format that is often used to persist data structures for later analysis.
- Scribe: A Thrift service for distributed logfile collection. Scribe was designed to run as a daemon process on every node in your data center and to forward log files from any process running on that machine back to a central pool of aggregators. Because of its ubiquity, a major design point was to make Scribe consume as little CPU as possible.
- Hive: Once the data has been serialized using Thrift and collected using Scribe, it can be loaded into a Hadoop cluster for analysis. Running Hive above your Hadoop cluster will allow you to query the data using a SQL-like syntax; Hive will also manage the partitioning of logs inside the Hadoop Distributed File System.
- Cassandra: If you’ve got millions of users requesting and updating data, Cassandra can help you scale with your community. Cassandra was designed to power inbox search at Facebook and is now storing an index of around 35 TB. Design points included incremental scalability and low system administration overhead; Cassandra could be useful in many places where a horizontally partitioned (“sharded”) MySQL instance is currently deployed.
I was recently invited by Robert Grossman of Open Data to speak about these projects at the inaugural Cloud Computing and Its Applications conference in Chicago. You can check out the slides from my talk below:
All of these projects have small but growing user communities. I hope you’ll find them useful for your data management projects, and I look forward to seeing a few new users on the mailing lists soon.
— Jeff Hammerbacher, VP Product and Chief Scientist, Cloudera