In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Apache Hadoop and Apache Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.
About a year ago, after hearing increasing buzz about big data in general, and Hadoop in particular, I (Brad Rubin) saw an opportunity to learn more at our Twin Cities (Minnesota) Java User Group. Brock Noland, the local Cloudera representative, gave an introductory talk. I was really intrigued by the thought of leveraging commodity computing to tackle large-scale data processing. I teach several courses at the University of St. Thomas Graduate Programs in Software, including one in information retrieval. While I had taught the abstract principles behind the scale and performance solutions for indexing web-sized document collections, I saw an opportunity to integrate a real-world solution into the course.
Our department had an idle computing cluster. While it wasn’t an ideal Hadoop platform, because of the limited disk arms available in the blade configuration, our computing support staff and a grad student installed Ubuntu and Hadoop. We immediately had trouble with frequent crashes, and Brock came by to diagnose our problem as a hardware memory configuration issue. We got the cluster running just in time for use by a few student projects in my information retrieval class. We decided to go with Cloudera’s Distribution Including Apache Hadoop (CDH) because initially learning about the technologies and bringing up a new cluster is complex enough, and we wanted the benefit of having a software collection that was already configured to work together, including patches. The mailing lists were also an important benefit, allowing search for problem solutions posted by others, and quick responses to new questions by Cloudera employees and other users.
In December, I had lunch with a faculty member from our University’s biology department, Jadin Jackson. He is a clinical faculty member and a neuroscientist who recently joined our University after finishing several post-doc research positions. Jadin described his work analyzing rat brain EEG waveforms with a Matlab workstation to try to understand the neural communication between different brain regions while the rats run a maze. The task is very compute and data intensive. I described my recent interest in Hadoop. We soon wondered if Hadoop might be a good solution for Jadin’s digital signal processing application. While most existing Hadoop applications are I/O intensive, we thought it would be interesting to explore this CPU intensive task with this computing architecture. Since any real-world application would provide a great vehicle for improving our knowledge of the Hadoop ecosystem, we agreed to purse this.
Jadin has a background in electrical engineering and physics, as well as neuroscience, and has always been interested in computing. He viewed this project not only as a way to tackle his backlog of rat brain signal data to further his research, but also as a way to explore recent developments in cluster computing. Jadin had already developed the computational, statistical, and visualization techniques for this analysis, so he was our domain expert, guiding development, providing test data, and addressing questions.
One of the students from my information retrieval class, Ashish Singh, was interested in an independent study opportunity to improve his Hadoop knowledge, so he joined our team. The three of us then spent the 2012 spring semester using Hadoop, with Hive added in as well, to analyze rat brain neuronal signals. Our university has a Faculty Partnership Program that encourages interdisciplinary collaboration, so Jadin taught me some neuroscience and I taught Jadin about the Hadoop ecosystem. When I saw that Brock was teaching a local Cloudera Hadoop development class, I signed up and also became certified.
Prior to starting this work, Jadin had data previously gathered by himself and from neuroscience researchers who are interested in the role of the brain region called the hippocampus. In both rats and humans, this region is responsible for both spatial processing and memory storage and retrieval. For example, as a rat runs a maze, neurons in the hippocampus, each representing a point in space, fire in sequence. When the rat revisits a path, and pauses to make decisions about how to proceed, those same neurons fire in similar sequences as the rat considers the previous consequences of taking one path versus another. In addition to this binary-like firing of neurons, brain waves, produced by ensembles of neurons, are present in different frequency bands. These act somewhat like clock signals, and the phase relationships of these signals correlate to specific brain signal pathways that provide input to this sub-region of the hippocampus.
The goal of the underlying neuroscience research is to correlate the physical state of the rat with specific characteristics of the signals coming from the neural circuitry in the hippocampus. Those signal differences reflect the origin of signals to the hippocampus. Signals that arise within the hippocampus indicate actions based on memory input, such as reencountering previously encountered situations. Signals that arise outside the hippocampus correspond to other cognitive processing. In this work, we digitally signal process the individual neuronal signal output and turn it into spectral information related to the brain region of origin for the signal input.
While the initial impetus for this project was to learn more about the Hadoop ecosystem, there are several traditional compelling technical advantages of the recent big data technologies in general, and Hadoop in particular. Improved throughput for processing large datasets, and having all of the data online are two key advantages. Hadoop is a parallel computing architecture designed to take advantage of the increased I/O bandwidth (the speed at which data can be read or written) that results from having data spread across many disks. As the processor speeds for individual microprocessor cores have plateaued in recent years, the number of cores per processor has increased. However, this increase in core number has not been accompanied by dramatic I/O speed increases for reading from and writing to disk. Additionally, disk access latencies (the time it takes to find a location to read from or write to) have remained relatively constant for nearly a decade or more. This means that for a single computer, increasing processing speeds or the number of processor cores is of only limited value, since getting the data to the processor, i.e. I/O bandwidth, is a major bottleneck affecting performance for many types of applications. Since the Hadoop architecture spreads the data across the many hard disks in the machines (or nodes) within a cluster, with the help of a storage system called the Hadoop Distributed File System (HDFS), the effective I/O bandwidth is multiplied by the number of nodes (and disks within those nodes) available for use.
The Hadoop MapReduce architecture takes the processing to the data, rather than relying on moving data across or between nodes within the cluster. This means that Hadoop is especially adept at working with very large data sets that are spread across the large storage capacity comprised of all of the hard disks in the cluster’s nodes. The MapReduce paradigm starts with the map step, where the same operation is performed on each node where the data of interest resides. The map task on each node sends its results to a reduce task in the reduce step. The reduce tasks compile and collate the results, performing the aggregation operations needed to output the end product of the MapReduce job.
Importance of Hive
Since MapReduce jobs within Hadoop often require writing specialized Java functions for each type of job, Hive was developed to provide a standard interface for database programmers that closely resembles SQL, the standard language for data access, to handle large amounts of data. Hive brings a data warehousing system to Hadoop and for data stored in HDFS. Hive takes queries (commands to search, combine, or reorganize data) and executes them as MapReduce jobs within Hadoop, thereby simplifying the development of complex analyses that use steps commonly used in standard database queries.
We quickly found that the standard advice of “first, try coding the function in Hive, and if you can’t then do so in Java MapReduce” held true. We were able to use Hive for most of the processing in this project, with high productivity, and leverage our existing SQL skills.
In Part II, we will discuss the types of neural signals in this analysis, and technical details of its implementation in Hadoop and Hive.