Simple Moving Average, Secondary Sort, and MapReduce (Part 2)

This is the second post of a three part blog series. If you would like to read “Part 1,” please follow this link. In this post we will be reviewing a simple moving average in contexts that should be familiar to the analyst not well versed in Hadoop as to establish a common ground with the reader from which we can move forward.

A Quick Primer on Simple Moving Average in Excel

Let’s take a second to do a quick review of how we define simple moving average in an Excel spreadsheet. We’ll need to start with some simple source data, so let’s download a source csv file from github and save it locally. This file contains a synthetic 33 row sample of Yahoo NYSE stock data that we’ll use for the series of examples. Import the csv data into Excel. From there, scan to the date “3/5/2008” and move to the cell to the right of the “ad close” column. Enter the formula


where [column-range] is all of the columns from that date to 29 days prior. Now copy this formula for the next two rows, dates “3/4/2008” and “3/3/2008”.

SMA in Excel

You should have the values “35.396”, “34.5293”, and “33.5293” which represent the 30 day moving averages for this synthetic yahoo stock data.

Now that we’ve established a basic example in Excel let’s take a look at how we do Simple Moving Average in R.

A Quick Primer on Simple Moving Average in R

Another common tool in the time series domain, especially the financial sector, is the R programming language. R is:

  • A programming language and software environment for statistical computing and graphics.
  • A de facto standard among statisticians for statistical software development and data analysis.
  • An implementation of the S programming language combined with lexical scoping semantics inspired by Scheme.
  • Currently developed by the R Development Core Team, but was originally developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.

Download the R binary from [here] and install it locally (they support both linux and win32). Once installed, launch the R console and drop the “Packages” menu down, which is where we need to install the TTR package. Select a mirror and download this package. Now load this package by clicking on the “Packages” drop down and selecting “Load Package”. Find the TTR package that was just installed and select it. Next, download the synthetic stock data from my project on github which contains 33 lines of synthetic stock data to process. In order to load this CSV data in R we need to set our working directory by clicking on the menu item “File” and then “Change directory”.

Quick tip: at any time the user can type the name of the variable and hit Enter to display the contents of the variable. Now that we have all the prep out of the way, let’s write the simple moving average in R:


To check that our stock data is indeed loaded, we can type the name of the variable, here “sorted_stock_data”, and hit enter which will produce:


>exchange stock_symbol       date  open  high   low close   volume adj.close
32     NYSE           AA 2008-02-03 38.85 39.28 38.26 38.37 11279900      8.37
31     NYSE           AA 2008-02-04 37.01 37.90 36.13 36.60 17752400     10.60
30     NYSE           AA 2008-02-05 31.16 31.89 30.55 30.69 17567800     30.53
29     NYSE           AA 2008-02-06 30.27 31.52 30.06 31.47  8445100     31.31
28     NYSE           AA 2008-02-07 31.73 33.13 31.57 32.66 14338500     32.49
27     NYSE           AA 2008-02-08 32.58 33.42 32.11 32.70 10241400     32.53
26     NYSE           AA 2008-02-09 32.13 33.34 31.95 33.09  9200400     32.92
25     NYSE           AA 2008-02-10 33.67 34.45 33.07 34.28 15186100     34.10
24     NYSE           AA 2008-02-11 34.57 34.85 33.98 34.08  9528000     33.90
23     NYSE           AA 2008-02-12 33.30 33.64 32.52 32.67 11338000     32.50
22     NYSE           AA 2008-02-13 32.95 33.37 32.26 32.41  7230300     32.41
21     NYSE           AA 2008-02-14 32.24 33.25 31.90 32.78  9058900     32.78
20     NYSE           AA 2008-02-15 32.67 33.81 32.37 33.76 10731400     33.76
19     NYSE           AA 2008-02-16 33.82 34.25 33.29 34.06 11249800     34.06
18     NYSE           AA 2008-02-17 34.33 34.64 33.26 33.49 12418900     33.49
17     NYSE           AA 2008-02-18 33.75 35.52 33.63 35.51 21082100     35.51
16     NYSE           AA 2008-02-19 36.01 36.43 35.05 35.36 18238800     35.36
15     NYSE           AA 2008-02-20 35.16 35.94 35.12 35.72 14082200     35.72
14     NYSE           AA 2008-02-21 36.19 36.73 35.84 36.20 12825300     36.20
13     NYSE           AA 2008-02-22 35.96 36.85 35.51 36.83 10906600     36.83
12     NYSE           AA 2008-02-23 36.88 37.41 36.25 36.30 13078200     36.30
11     NYSE           AA 2008-02-24 36.38 36.64 35.58 36.55 12834300     36.55
10     NYSE           AA 2008-02-25 36.64 38.95 36.48 38.85 22500100     38.85
9      NYSE           AA 2008-02-26 38.59 39.25 38.08 38.50 14417700     38.50
8      NYSE           AA 2008-02-27 38.19 39.62 37.75 39.02 14296300     39.02
7      NYSE           AA 2008-02-28 38.61 39.29 38.19 39.12 11421700     39.12
6      NYSE           AA 2008-02-29 38.77 38.82 36.94 37.14 22611400     37.14
5      NYSE           AA 2008-03-01 37.17 38.46 37.13 38.32 13964700     38.32
4      NYSE           AA 2008-03-02 37.90 38.94 37.10 38.00 15715600     38.00
3      NYSE           AA 2008-03-03 38.25 39.15 38.10 38.71 11754600     38.71
2      NYSE           AA 2008-03-04 38.85 39.28 38.26 38.37 11279900     38.37
1      NYSE           AA 2008-03-05 37.01 37.90 36.13 36.60 17752400     36.60

The above code should produce our simple moving average, which we can view by typing the name of the variable “sma” to produce the following result:

> sma

[1]       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA       NA

[21]       NA       NA       NA       NA       NA       NA       NA       NA       NA 33.52933 34.52933 35.39600

Given that before the 30th day there is not enough data to produce a simple moving average based on our set parameter, the “NA” entries are produced. These values also match the values in our Excel spreadsheet.

R also has an interesting project, called RHIPE, which runs R code on Hadoop clusters. To take a look at RHIPE please visit their site.

So we’ve taken a look at what a simple moving average is and how we’d produce it in Excel and R. Both of these examples involved a token amount of data that is interesting but not terribly useful in today’s high-density time series problem domains. As your data set begins to scale up beyond a single disk worth of space, Hadoop becomes more practical.

The final portion of this three part blog series will explain how to use Hadoop’s MapReduce to calculate a Simple Moving Average. Then once you have applied the sample code to find a Simple Moving Average of the small example data set, we will move on to use this same code to parse over thirty years worth of all daily stock closing prices.

Filed under:

2 Responses

Leave a comment


5 − = four