This post was contributed by The Global Biodiversity Information Facility development team.
The Global Biodiversity Information Facility is an international organization, whose mission is to promote and enable free and open access to biodiversity data worldwide. Part of this includes operating a search, discovery and access system, known as the Data Portal; a sophisticated index to the content shared through GBIF. This content includes both complex taxonomies and occurrence data such as the recording of specimen collection events or species observations. While the taxonomic content requires careful data modeling and has its own challenges, it is the growing volume of occurrence data that attracts us to the Apache Hadoop stack.
The Data Portal was launched in 2007. It consists of crawling components and a web application, implemented in a typical Java solution consisting of Spring, Hibernate and SpringMVC, operating against a MySQL database. In the early days the MySQL database had a very normalized structure, but as content and throughput grew, we adopted the typical pattern of denormalisation and scaling up with more powerful hardware. By the time we reached 100 million records, the occurrence content was modeled as a single fixed-width table. Allowing for complex searches containing combinations of species identifications, higher-level groupings, locality, bounding box and temporal filters required carefully selected indexes on the table. As content grew it became clear that real time indexing was no longer an option, and the Portal became a snapshot index, refreshed on a monthly basis, using complex batch procedures against the MySQL database. During this growth pattern we found we were moving more and more operations off the database to avoid locking, and instead partitioned data into delimited files, iterating over those and even performing joins using text files by synthesizing keys, sorting and managing multiple file cursors. Clearly we needed a better solution, so we began researching Hadoop. Today we are preparing to put our first Hadoop process into production.
Our first objective is to address the monthly processing we perform. This area of work does not increase functionality offered through the portal (that will be addressed following this infrastructural work) but rather aims to:
- Reduce the latency between a record changing on the publisher side, and being reflected in the index
- Reduce the amount of (wo)man-hours needed to coax through a successful processing run
- Improve the quality assurance by inclusion of
- Checking that terrestrial point locations fall within the stated country using shapefiles
- Checking coastal waters using Exclusive Economic Zones
- Rework all the date and time handling
- Use dictionaries (vocabularies) for interpretation of fields such as Basis of Record
- Integrate checklists (taxonomic, nomenclatural and thematic) shared through the GBIF ECAT Programme to improve the taxonomic services, and the backbone (“nub”) taxonomy.
- Provide a robust framework for future development
- Allow the infrastructure to grow predictably with content and demand growth
Things have progressed significantly since the early Hadoop investigations which included hand crafting MapReduce jobs, and GBIF are now developing using the following technologies:
- Apache Hadoop: A distributed file system and cluster processing using the Map Reduce framework
- GBIF are using the Cloudera’s Distribution including Apache Hadoop
- Sqoop: A utility to synchronize between relational databases and Hadoop
- Hive: A data warehouse infrastructure built on top of Hadoop, and developed and open-sourced by Facebook. Hive gives SQL capabilities on Hadoop, which is particularly attractive to a development team fluent in SQL. [Full table scans on GBIF occurrence records reduce from hours to minutes]
- Oozie: An open-source workflow/coordination service to manage data processing jobs for Hadoop, developed and then open-sourced by Yahoo!
The processing architecture is depicted:
Following this processing work, we expect to modify our crawling to harvest directly into HBase. The flexibility HBase offers will allow us to grow incrementally the richness of the terms indexed in the Portal, while integrating nicely into Hadoop based workflows. The addition of coprocessors to HBase is of particular interest to further reduce the latency involved in processing, by eliminating batch processing altogether.
The combination of Hadoop, Oozie and Hive offer a framework that we anticipate will fit nicely with many of our data transformation tasks, and Sqoop and Hive have made the technologies far more accessible to our development team than was previously possible.
All GBIF source code is available under open source licensing, and this work is regularly blogged on the GBIF Developer Blog.