In this three part blog series I want to take a look at how we would do a Simple Moving Average with MapReduce and Apache Hadoop. This series is meant to show how to translate a common Excel or R function into MapReduce java code with accompanying working code and data to play with. Most analysts can take a few months of stock data and produce an excel spreadsheet that shows a moving average, but doing this in Hadoop might be a more daunting task. Although time series as a topic is relatively well understood, I wanted to take the approach of using a simple topic to show how it translated into a powerful parallel application that can calculate the simple moving average for a lot of stocks simultaneously with MapReduce and Hadoop. I also want to demonstrate the underlying mechanic of using the secondary sort technique with Hadoops MapReduce shuffle phase, which well see is applicable to a lot of different application domains such as finance, sensor, and genomic data.
This article should be approachable to the beginner Hadoop programmer who has done a little bit of MapReduce in java and is looking for a slightly more challenging MapReduce application to hack on. In case you’re not very familiar with Hadoop, here’s some background information and CDH. The code in this example is hosted on github and is documented to illustrate how the various components work together to achieve the secondary sort effect. One of the goals of this article is to have this code be relatively basic and approachable by most programmers.
So lets take a quick look at what time series data is and where it is employed in the quickly emerging world of large-scale data.
What is Time Series Data?
Time series data is defined as a sequence of data points measured typically at successive times spaced at uniform time intervals. Time series data is typically seen in statistics, signal processing, and finance along with other fields. Examples of time series data are the daily adjusted close price of a stock at the NYSE or sensor readings on a power grid occuring 30 times a second.
Time series as a general class of problems has typically resided in the scientific and financial domains. However, due to the ongoing explosion of available data, time series data is becoming more prevalent across a wider swath of industries. Time Series sensors are being ubiquitously integrated in places like:
Its also been shown that shapes in images can be decomposed into time series data which allows the shapes to achieve rotation and scale invariance allowing for easier comparison. Another sector showing explosive growth in the amount of time series data produced is the genomic and bioinformatic realm. Were seeing the cost to sequence the human genome continue to decrease rapidly, shifting pressure to the storage and processing technologies for these genomes. Genome data in its text representation (GATC) can be represented as time series and thus these problems are approachable by all techniques relevant to time series processing. Time series processing underlies some techniques used in the genomics domain such as motif finding which can be approached in the same way as the median string problem. The understanding of how we can refactor traditional approaches to these time series problems when inputting into MapReduce can potentially allow us to improve processing and analysis techniques in a timely fashion.
The financial industry has long been interested in time series data and have employed programming languages such as R to help deal with this problem. The R programming language was created specifically for this class of data–as shown in the R example below. So, why would a sector create a programming language specifically for one class of data when technologies like RDBMS have existed for decades? In reality, current RDBMs technology has limitations when dealing with high-resolution time series data. These limiting factors include:
- High-frequency time series data coming from a variety of sources can create huge amounts of data in very little time
- RDBMS’s tend to not like storing and indexing billions of rows.
- Non-distributed RDBMS’s tend to not like scaling up into the hundreds of GBs, let alone TBs or PBs.
- RDBMS’s that can scale into those arenas tend to be very expensive, or require large amounts of specialized hardware
Problems with RDBMS’s queries on high resolution time series data:
- To process high resolution time series data with a RDBMS we’d need to use an analytic aggregate function in tandem with moving window predicates (ex: the “OVER” clause) which results in rapidly increasing amounts of work to do as the granularity of time series data gets finer.
- Query results are not perfectly commutable and cannot do variable step sliding windows (ex: step 5 seconds per window move) without significant unnecessary intermediate work or non-standard SQL functions.
- Queries on RDBMS for time series for certain techniques can be awkward and tend to require premature subdividing of the data and costly reconstruction during processing (example: Data mining, iSAX decompositions)
- Due to the above factors, with large amounts of time series data RDBMS performance degrades while scaling.
Most simple time series calculations are performed with everyones favorite analysis tool: the spreadsheet. However, when we need to look at data that is beyond the 65k row limit of Excel how does our approach evolve as we scale our data up? In this article well stop to take a look at the issues involved when scaling data before we jump into MapReduce and how Hadoop approaches things. Lets start with a simple moving average on a small sample of data in Excel. Well progress onto the same example in R and then well work our way toward a full blown MapReduce application in java (code included). Once we have our sample data working well with MapReduce, well calculate the simple moving average of all stocks on the NYSE from 1970 to the present in one pass without changing any code.
Simple Moving Average
A simple moving average is the series of unweighted averages in a subset of time series data points as a sliding window progresses over the time series data set. Each time the window is moved we recalculate the average of the points in the window. This produces a set of numbers representing the final moving average. Typically the moving average technique is used with time series to highlight longer term trends or smooth out short-term noise. Moving averages are similar to low pass filters in signal processing, and mathematically are considered a type of convolution.
In other terms, we take a window and fill it in a First In First Out (FIFO) manner with time series data points until we have N points in it. We then take the average of these points and add this to our answer list. We slide our window forward by M data points and again take the average of the data points in the window. This process is repeated until the window can no longer be filled at which point the calculation is complete. Now that we have a general idea of what we are looking at, lets take a look at a few ways to do a simple moving average.
In parts 2 and 3 of this blog series we’ll take the reader from simple moving average in Excel, through R, and then into a real example with code of simple moving average in MapReduce.