As mentioned in Part I, although Apache Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop and Apache Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.
For this work, we used a university Hadoop computing cluster. Note that it is blade-based, and is not an ideal configuration for Hadoop because of the limited number (2) of drive bays per node. It has these specifications:
- 24 Nodes (1 master, 23 slaves), running Ubuntu Server
- Nodes are Sun Fire X2200M2 with 2x AMD Opteron 2214 DualCore 2.2GHz processors
- Master: 18GB RAM, 2x 1 TB drives (RAID 1 mirrored)
- NameNode, SecondaryNameNode, JobTracker, MySQL for Hive metastore, user local home directories
- Slaves: 12GB RAM, 250 GB + 1 TB drives
- DataNode, TaskTracker, 2 CPU core slots for mappers and reducers
- Slots available for map/reduce tasks: 46
For this problem, the brain wave frequency band known as theta, which in the rat brain runs from 7 to 10 Hz, acts much like a master clock signal that encodes other higher frequency bands at different phases, such as the peak, trough, leading, or trailing edges of the oscillation. Specifically, the low gamma band, which runs from 35 to 60 Hz, signifies that input signals are coming into the hippocampus from within the hippocampus itself, from a region known as CA3. This high frequency wave appears as a component on the leading edge and peak of the lower frequency theta wave. The high gamma band, which runs from 85 to 165 Hz, signifies signals coming into the hippocampus from other brain structures outside that region, and reflect more cognitive processing. It appears as a component on the trailing edge of the theta wave.
The neuroscience researchers were interested in analyzing more than a hundred data sets, which we call rat runs, or sessions. Each session was recorded from a 50-electrode array consisting of two reference electrodes and twelve tetrodes, or groups of 4 electrodes (tetrode is the term used to refer to a specialized electrode capable of recording four channels of data within a small region of the brain, think of a 4 channel surround-sound microphone detecting differences in sound depending on whether the source is in front, behind, left, or right of the microphone). Each set, therefore, consisted of parallel recordings of two reference channels and 48 channels from the 12 tetrodes implanted in the rat brain, which, after some preprocessing, corresponds to signals from individual neurons. The two reference channels helped reduce both extra and intra rat brain noise. Up to 16 channels could be selected from these 50 channels for continuous monitoring of the local electrical fields that the brain generates (LFP, Local Field Potentials; a.k.a EEG). Each rat run was about an hour long, and each channel was sampled at a rate of 2 KHz. The researchers may be specifically interested in a subset of the 16 continuously sampled channels depending on the location they were recorded from, resulting in a set of up to 16 files, one for each continuously sampled channel, that each consisted of about 6 million records, each in turn consisting of a timestamp and a voltage value, in a coma-separated text format.
The single channel of data to be analyzed consists of:
- Voltage values sampled with 12 bit precision
- Sampling rate: 2,000 samples/sec
- Time-stamp values: 64 bit integer values at 1/10,000 sec precision
- Typical recording session length: 1 to 3 hours
The following figure shows the processing steps involved in the analysis, as well as some of the data sizes involved.
The neuroscience researchers were interested in both the frequency content of the neuronal signals and the changes in the signals during different behaviors within the one-hour time period. While traditional Fourier analysis can separate the signals into their weighted frequency components, wavelet analysis is better suited to highlighting both frequency and time-dependent variation. Specifically, we used a set of kernels from the Morlet wavelet family. These resemble a sign wave modulated by a Gaussian envelope. The researchers were interested in a specific range of frequencies, so we used kernels that run from 5 Hz to 200 Hz, in one Hertz increments. Here is an example of a 10 Hz Morlet kernel.
For each of the 15 signals, we needed to convolve each of the 196 kernels (5-200 Hz) with the signal. At first, we implemented a brute-force approach to calculating these convolutions by successive dot product and shift operations. In the end, we implemented a more efficient Fast Fourier Transform (FFT)-based solution by first taking the discrete Fourier transforms of a given signal, then taking the discrete Fourier transform of a kernel, multiplying together the two resulting sequences of complex numbers, and final taking the inverse discrete Fourier transform of that result. We did this operation for each kernel per channel, on each channel of interest. The output from the convolution was then squared, to get the power value of the signal. This code was written as a Java MapReduce job, leveraging an external FFT library called JTransforms https://sites.google.com/site/piotrwendykier/software/jtransforms.
Since Hadoop can parallelize operations, we had to decide on the best boundary for the parallelism. We decided that each Hadoop task would process all kernels for a single channel, so that up to 16 parallel tasks could process all of the channels for a single rat run concurrently, depending on the number of channels to be analyzed for that rat run. Since our cluster had 46 computing slots, this allowed approximately three entire rat runs to be processed in parallel. For testing, we selected a session with 15 unique channels to validate and benchmark each processing step.
We wanted to leverage Hive’s partitioning support, which allowed the query optimizer to be aware of the subdirectory structure of the MapReduce output and only read in the subdirectories needed to resolve a specific query. Also, to avoid copying data from HDFS into a Hive table, we used Hive’s external table support to treat the HDFS output as a hive table. This is our convolution output directory structure:
The convolution step produced a huge amount of data, effectively increasing the data to be stored by factor of almost 200. We addressed this data volume issue in two ways. First, we had been using a text output format for the convolution step, which is the norm in Hadoop processing because of its ease of processing. We moved to a binary format, called a SequenceFile, which caused our bytes/record value to go down from 30.5 to 18.2. In addition, we averaged all data channels taken from each of the tetrodes from the same brain region together. It turns out that this averaging was very compute intensive, and will be the focus of future research. This code was done with a Hive SQL task that leveraged all parallel computing slots, with a custom Hive Serde (serializer-deserializer) that allowed for sequence file input.
The power spectrum that results from the convolution operation was weighted heavily on low frequencies, and lightly on high frequencies (e.g. following a 1/f spectrum, where f is frequency). In order to normalize the data and make changes in power comparable across frequency bands, we do a z-scoring step. To do this, we calculated the mean and standard deviations of the output across time, so that we could later subtract the mean, and divide by the standard deviation. This code was done with a Hive SQL task that leveraged all parallel computing slots.
Since we are interested in correlating specific rat movements with brain signals, we subset the data by time intervals that correspond to the rat’s physical state, as recorded by the rat observer. This code was done with a Hive SQL task that leveraged all parallel computing slots.
Since we are ultimately interested in how high and low gamma waves appear in relation to the leading and trailing edges of the theta wave, we need to know the phase of the theta waves relative to the timestamps. In a pre-processing step, we used Matlab to identify the theta wave minima and maxima via a differentiation technique, which produces a file that shows the theta phase corresponding to each time value. This file is used as input to the subsequent phase bin assignment step.
We next assign each signal value to a phase bin. A phase bin is a segment of the uniformly-divided theta wave phase data. For our current experiment, we divided the phase (0 to 359 degrees) into 75 bins. We then joined the average channel convolution output with this phase information, by timestamp value. This produces a result that is independent of time, and consists of three values: a kernel frequency, a phase bin, and a z-scored, average of the channel squared convolution output for that kernel frequency.
In Part III, we will describe the results of these analyses, with an emphasis on the technical findings related to implementing the workflow just described.