This post was contributed by Friso van Vollenhoven from Xebia. Friso is a consultant at Xebia and he is currently working at the RIPE NCCs Information Services department on migrating an existing MySQL based solution to use Hadoop and HBase. Xebia performs large scale development projects, consultancy in architecture & auditing and helps to bring your middleware under control. Guiding clients in how to leverage agile is Xebia’s passion.
The RIPE NCC is one of five Regional Internet Registries (RIRs) providing Internet resource allocations, registration services and co-ordination activities that support the operation of the Internet globally. The RIPE NCC also provides services for the benefit of the Internet community at large. Amongst these is the Routing Information Service, which collects and stores Internet routing data from several locations around the globe. At RIPE NCC, we are in the progress of migrating an existing MySQL based system to use Apache HBase as storage backend and Hadoop MapReduce as framework for processing import jobs. In this post we will provide some background on our efforts, how we implemented it on top of Apache Hadoop and HBase and our experiences with using Hadoop and HBase in a real life scenario.
The global internet is a large and distributed system. One of the reasons it works is that it has a shared-nothing architecture at the core of its operation. Different parts of the network are autonomous systems which achieve their goal of routing traffic to destinations in other parts of the network by communicating with their neighboring networks, without obtaining global knowledge of the network beyond their peers. Providing a global view of the current state of operational details of such a system is very hard to do and requires a lot of data. At the RIPE NCCs Information Services department we aim to provide the closest thing to that by collecting substantial amounts of data on routing that goes on across the globe. Such a view can be useful for several reason — for example, when things go a little bit wrong on the internet. Like when Pakistan Telecom accidentally hijacked all of Youtubes traffic, or when Google lost a substantial amount of its traffic to a provider in Romania.
The autonomous systems of the internet communicate via a protocol called BGP to exchange information on routing. Our system works by listening in on BGP announcements in a number different places on the internet, usually close to or at large internet exchanges. We currently collect about 80 million BGP announcements per day (about 925 per second). This data collection has been going on for almost ten years, so we have a substantial collection of data. The raw uncompressed source data for this is several gigabytes per day and the total history is terabytes.
We currently operate a MySQL based system in production to hold this data and allow querying. The system has a number of limitations:
- It will only hold three months worth of data. Anything older than that is currently thrown out and is only available as raw source data (.tar.gz files), but not searchable. So investigating any event less recent than three months is a very mundane task.
- Insertion speed is an issue. Insertion into relational databases can become slow due to index updates and integrity checks. Currently inserting data can be done only slightly faster than it arrives, which means that when the insertion process lags behind for some reason, it can take weeks to catch up again.
- The internet is still growing rapidly, while this system is already at its maximum capacity, so it is not future proof.
- For the same reason, we cannot add data collection points.
Because of these limitations, we chose to re-implement the storage back end. Given the nature of the data we opted for a column based store. Initially, we investigated both HBase and Cassandra. The decision for HBase was made for a number of reasons, including:
- It has very good sequential read/scan performance, which is nice for full data analysis.
- At the time, there already was decent documentation and enough online resources available to get up and running.
- It is a proven setup in a number of real life scenarios.
- It integrates naturally with Hadoop and thus MapReduce, which is nice to have out of the box when working with large amounts of data.
Above is a high level overview of our setup. Once source data comes in, we move it to HDFS and do all processing there. From the source data, we create several derived data sets, which we store in a intermediate form on HDFS and insert into HBase.
We need the derived data sets in order to quickly answer specific queries. This is a big difference from the MySQL based solution. In the relational world, you typically normalize your data, spread it over a number of tables and apply joins when you need an answer to a specific question. In HBase this does not work. HBase will effectively do not much more than a key lookup for a certain value or range of values. With this model you really need to know in advance what kind of questions you need to answer and model the data storage accordingly. We keep several data sets with denormalized answers to a number of specific questions. You can see it as tables of pre-computed joins for all possible join combinations. Also, all your data in HBase has only one index. This means that for each additional index we add, we need to effectively duplicate the (denormalized) data.
Working with MapReduce has proven to be a very nice way of handling data. We try to do as much as possible in MapReduce jobs. This brings the benefit of fault tolerance and parallelization without having to think about it. We just assign more capacity to jobs as the dataset gets larger. For example a job that does an initial import of months or years of historical data is assigned more map and reduce slots than the import that runs every five minutes to keep up-to-date. Apart from the capacity planning, they are exactly the same piece of code, which is nice. Insertion into HBase is also modeled as a MapReduce job to achieve high parallelization.
We have not moved to production with the new setup yet, but some conclusions from our experience so far are:
- The running time of an import of one full year of data went from months to days.
- The solution scales well with the amount of data. There is no difference between having a couple of months or a couple of years of data online.
- Query performance is competitive with most RDBMS (especially compared to a sharded setup). Querying HBase for typical queries gives sub-second results. Results for large queries come in streaming at well over 15 MB/s.
- You dont necessarily need to process terabytes of data for MapReduce to be effective. Its about the ability to scale in accordance with the size of a job. It works for smaller jobs too and it will provide some fault tolerance in the process.
The above results basically mean that HBase delivered on its promise. Our four node development setup will happily do up to 300,000 operations per second. It solves an important problem for us which we could not do with the relational back end. Using a non-relational storage system for live queries against large data sets requires a lot of up-front thinking, but the bottom line is that the relational storage just would not scale to the point that we need it to.
Our production setup will be slightly larger than the development environment. The cluster will have 8 worker nodes, each running a data node, task tracker and region server. There are two master nodes in a Linux HA setup which run the name node, job tracker and master server. Because we find that we are completely IO bound in our jobs on the development cluster, we choose to scale out the IO; each data node has 10 data disks of 600GB (10K RPM). The workers have 64 GB RAM each. A lot of it goes to HBase. This will basically mean that the hotspot in our data set will mostly be in RAM, which helps off load the disks.
Having all our data on HDFS and in HBase provides a lot of opportunities to do things that were hard or impossible before. Creating a job that goes over all historical data and extracts statistical information, detects anomalies or produces data for interesting graphs or trends has all become a lot easier. We are already looking into things in that direction.
Our setup uses CDH3 beta. In development we install from tar archives. In production we will use the provided RPMs. Our experience with Hadoop and HBase has been generally good. It mostly works as advertised. Here are a number of lessons learned:
- Getting a basic setup up and running is relatively straightforward. Tuning and troubleshooting is the hard part.
- Distributed debugging is hard and, above all, time consuming. Make sure you know where all your logs go (before you need them). When you know what information you need, getting it should not be an obstacle.
- Tuning Hadoop jobs and HBase is non-trivial (just like properly tuning a RDBMS for high performance). Prepare to spend time investigating the internals of the framework and HBase, or get help.
The work described above is just an initial effort. Probably there will be ongoing adoption of Hadoop and HBase for importing and analyzing some of the other data sources that we harvest apart from routing information. Also, the production cluster environment will more become a first class citizen of the IT infrastructure of RIPE NCC for holding, querying and processing data.
Id like to thank the guys who have all contributed to the work described here, including Age Mooij, Erik Romijn and Szabolcs Andrasi and Erik Rozendaal who has initiated the use of Hadoop and HBase at RIPE NCC.