Adopting Apache Hadoop in the Federal Government
Loren Siebert is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.
Phase 1: Search analytics
All of the search and API traffic across hundreds of affiliate sites, iPhone apps, and widgets comes through a single search service, and this generates a lot of data. To improve the service, administrators wanted to see aggregated information on what sorts of information searchers were looking for, how well they were finding it, what trends were forming, and so on. Once searches were initiated, they also wanted to know what results were shown and then what results were clicked on. They wanted to see all this information broken down by affiliate over time, and also aggregated across the entire affiliate landscape.
The initial system, like many initial systems, was fairly simple and did just enough to address our most pressing analytics needs. We took the logs from Apache and Ruby on Rails, put them in a big MySQL database on a separate machine, ran nightly and monthly jobs on them, and then exported the summary results to the main database cluster to be served up via a low-latency Rails web interface for analytics. We used a separate physical database machine with lots of memory and disk for the batch processing to keep our production MySQL instances from being impacted by this resource-intensive batch processing.
As we watched the main database tables grow and the nightly batch jobs take longer and longer, it became clear that we would soon exhaust the resources available on the single database analytics processing node. We looked at scaling up the hardware vertically and sharding the database horizontally, but both options seemed like we were just kicking the can down the road. Larger database hardware would be both costly and eventually insufficient for our needs, and sharding promised to take all the usual issues associated with a single database system (backups, master/slaves, schema management) and multiply them. We wanted the system to be able to grow cost effectively and without downtime, be naturally resilient to failures, and have backups handled sensibly. It was at this point that we started investigating HDFS, Hadoop, and Apache Hive.
HDFS offered us a distributed, resilient, and scalable filesystem while Hadoop promised to bring the work to where the data resided so we could make efficient use of local disk on multiple nodes. Hive, however, really pushed our decision in favor of a Hadoop-based system. Our data is just unstructured enough to make traditional RDBMS schemas a bit brittle and restrictive, but has enough structure to make a schema-less NoSQL system unnecessarily vague. Hive let us compromise between the two— it’s sort of a “SomeSQL” system.
But best of all, we could layer the entire Cloudera stack on top of a subset of our existing production machines. By making use of each machine’s excess reserve capacity of disk, CPU, and RAM, we were able to get a small proof-of-concept cluster stood up without purchasing any new hardware. The initial results confirmed that our workload lent itself well to distributed processing, as one job went from taking over an hour on a MySQL node to 20 minutes on a three machine Hadoop cluster. Within a week of getting the prototype up and running, we had transitioned all the remaining nightly analytics batch SQL jobs into Hive scripts. The job output fed into a collection of intermediate Hive tables, from which we generated summary data to export to MySQL as low-latency tables for the Rails web interface to use. To prove the scaling point, we spent five minutes adding another datanode/tasktracker to the mix, kicked off the cluster rebalancer, and the whole process ran faster the next day.
Phase 2: Feedback loop
The result of all this analysis in Hive shows up not just in various analytics dashboards, but as part of the search experience on many government websites, too. For example, compare the different type-ahead suggestions for ‘gran’ on nps.gov and usa.gov. Both sites use the same USASearch backend system, but the suggestions differ completely. We use Hadoop to help us generate contextually relevant and timely search suggestions for hundreds of government sites like this.
Phase 3: Internal monitoring
Shortly after moving our event stream analysis from MySQL to Hadoop/Hive, we noticed a change in how we thought about data. Freed from the constant anxiety of wondering how we were going to handle an ever-increasing amount of data, we shifted from trying to store only what we really needed to storing whatever we thought might be useful. We first turned our attention to the performance data emitted by the various sub-systems that make up the search program.
Each search results page is potentially made up of many data modules sintered together from calls to the Bing API, our own Solr indexes, a MySQL cluster, and a Redis cache. A small latency problem with any one of them can propagate through the system to create much larger problems, so we want to have a deep knowledge of how each subsystem behaves throughout the day under various circumstances. We were already monitoring the availability of all these services with Opsview, but we had no insight into their performance over time. Whenever we sensed a problem (“Is one of the Solr indexes getting slow?”), we would liberally apply ssh, tail -f, and grep to try to see what was going on. This seemed like a good use case for Hive, so in the case of Solr, for example, we threw the compressed log files into HDFS, wrote a simple SerDe regular expression to define rows in a Hive table partitioned by date, and built a view on top of that for easy manipulation of extracted columns such as the Solr index name and the hour of day. Hive makes it trivially easy to do some fairly sophisticated aggregate analysis on the response times for each Solr index, such as generating a distribution histogram, or calculating the pth percentile.
Some of these Hive queries are not run very often— perhaps just a few times a month— so we don’t want to spend time building them into our test-driven Rails analytics framework. On the other hand, they are complex enough that we don’t want to rewrite them every time we want to use them. For these cases, we use Beeswax in HUE to save off parameterized queries that can then be shared among engineers or analysts to run on an ad hoc basis.
In the space of a few months, we’ve gone from having a brittle and hard-to-scale RDBMS-based analytics platform to a much more agile Hadoop-based system that was designed to scale intrinsically. We continue to see our Hadoop usage grow in scope with each new data source we add, and it’s clear that we’ll be relying on it more and more in the future as the suite of tools and resources around Hadoop grows and matures.