Simple Moving Average, Secondary Sort, and MapReduce (Part 2)
- by Josh Patterson
- March 16, 2011
- 2 comments
This is the second post of a three part blog series. If you would like to read “Part 1,” please follow this link. In this post we will be reviewing a simple moving average in contexts that should be familiar to the analyst not well versed in Hadoop as to establish a common ground with the reader from which we can move forward.
A Quick Primer on Simple Moving Average in Excel
Let’s take a second to do a quick review of how we define simple moving average in an Excel spreadsheet. We’ll need to start with some simple source data, so let’s download a source csv file from github and save it locally. This file contains a synthetic 33 row sample of Yahoo NYSE stock data that we’ll use for the series of examples. Import the csv data into Excel. From there, scan to the date “3/5/2008” and move to the cell to the right of the “ad close” column. Enter the formula
=AVERAGE( [column-range] )
where [column-range] is all of the columns from that date to 29 days prior. Now copy this formula for the next two rows, dates “3/4/2008” and “3/3/2008”.

You should have the values “35.396”, “34.5293”, and “33.5293” which represent the 30 day moving averages for this synthetic yahoo stock data.
Now that we’ve established a basic example in Excel let’s take a look at how we do Simple Moving Average in R.
A Quick Primer on Simple Moving Average in R
Another common tool in the time series domain, especially the financial sector, is the R programming language. R is:
- A programming language and software environment for statistical computing and graphics.
- A de facto standard among statisticians for statistical software development and data analysis.
- An implementation of the S programming language combined with lexical scoping semantics inspired by Scheme.
- Currently developed by the R Development Core Team, but was originally developed by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand.
Download the R binary from [here] and install it locally (they support both linux and win32). Once installed, launch the R console and drop the “Packages” menu down, which is where we need to install the TTR package. Select a mirror and download this package. Now load this package by clicking on the “Packages” drop down and selecting “Load Package”. Find the TTR package that was just installed and select it. Next, download the synthetic stock data from my project on github which contains 33 lines of synthetic stock data to process. In order to load this CSV data in R we need to set our working directory by clicking on the menu item “File” and then “Change directory”.
Quick tip: at any time the user can type the name of the variable and hit Enter to display the contents of the variable. Now that we have all the prep out of the way, let’s write the simple moving average in R:
stock_data <- read.csv(file="yahoo_stock_AA_32_mini.csv",head=TRUE,sep=",")
sorted_stock_data <- stock_data[order(stock_data$date) , ]
sma <- SMA(sorted_stock_data[,"adj.close"], 30)
To check that our stock data is indeed loaded, we can type the name of the variable, here “sorted_stock_data”, and hit enter which will produce:
> sorted_stock_data
>exchange stock_symbol date open high low close volume adj.close
32 NYSE AA 2008-02-03 38.85 39.28 38.26 38.37 11279900 8.37
31 NYSE AA 2008-02-04 37.01 37.90 36.13 36.60 17752400 10.60
30 NYSE AA 2008-02-05 31.16 31.89 30.55 30.69 17567800 30.53
29 NYSE AA 2008-02-06 30.27 31.52 30.06 31.47 8445100 31.31
28 NYSE AA 2008-02-07 31.73 33.13 31.57 32.66 14338500 32.49
27 NYSE AA 2008-02-08 32.58 33.42 32.11 32.70 10241400 32.53
26 NYSE AA 2008-02-09 32.13 33.34 31.95 33.09 9200400 32.92
25 NYSE AA 2008-02-10 33.67 34.45 33.07 34.28 15186100 34.10
24 NYSE AA 2008-02-11 34.57 34.85 33.98 34.08 9528000 33.90
23 NYSE AA 2008-02-12 33.30 33.64 32.52 32.67 11338000 32.50
22 NYSE AA 2008-02-13 32.95 33.37 32.26 32.41 7230300 32.41
21 NYSE AA 2008-02-14 32.24 33.25 31.90 32.78 9058900 32.78
20 NYSE AA 2008-02-15 32.67 33.81 32.37 33.76 10731400 33.76
19 NYSE AA 2008-02-16 33.82 34.25 33.29 34.06 11249800 34.06
18 NYSE AA 2008-02-17 34.33 34.64 33.26 33.49 12418900 33.49
17 NYSE AA 2008-02-18 33.75 35.52 33.63 35.51 21082100 35.51
16 NYSE AA 2008-02-19 36.01 36.43 35.05 35.36 18238800 35.36
15 NYSE AA 2008-02-20 35.16 35.94 35.12 35.72 14082200 35.72
14 NYSE AA 2008-02-21 36.19 36.73 35.84 36.20 12825300 36.20
13 NYSE AA 2008-02-22 35.96 36.85 35.51 36.83 10906600 36.83
12 NYSE AA 2008-02-23 36.88 37.41 36.25 36.30 13078200 36.30
11 NYSE AA 2008-02-24 36.38 36.64 35.58 36.55 12834300 36.55
10 NYSE AA 2008-02-25 36.64 38.95 36.48 38.85 22500100 38.85
9 NYSE AA 2008-02-26 38.59 39.25 38.08 38.50 14417700 38.50
8 NYSE AA 2008-02-27 38.19 39.62 37.75 39.02 14296300 39.02
7 NYSE AA 2008-02-28 38.61 39.29 38.19 39.12 11421700 39.12
6 NYSE AA 2008-02-29 38.77 38.82 36.94 37.14 22611400 37.14
5 NYSE AA 2008-03-01 37.17 38.46 37.13 38.32 13964700 38.32
4 NYSE AA 2008-03-02 37.90 38.94 37.10 38.00 15715600 38.00
3 NYSE AA 2008-03-03 38.25 39.15 38.10 38.71 11754600 38.71
2 NYSE AA 2008-03-04 38.85 39.28 38.26 38.37 11279900 38.37
1 NYSE AA 2008-03-05 37.01 37.90 36.13 36.60 17752400 36.60
The above code should produce our simple moving average, which we can view by typing the name of the variable “sma” to produce the following result:
> sma
[1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
[21] NA NA NA NA NA NA NA NA NA 33.52933 34.52933 35.39600
Given that before the 30th day there is not enough data to produce a simple moving average based on our set parameter, the “NA” entries are produced. These values also match the values in our Excel spreadsheet.
R also has an interesting project, called RHIPE, which runs R code on Hadoop clusters. To take a look at RHIPE please visit their site.
So we’ve taken a look at what a simple moving average is and how we’d produce it in Excel and R. Both of these examples involved a token amount of data that is interesting but not terribly useful in today’s high-density time series problem domains. As your data set begins to scale up beyond a single disk worth of space, Hadoop becomes more practical.
The final portion of this three part blog series will explain how to use Hadoop’s MapReduce to calculate a Simple Moving Average. Then once you have applied the sample code to find a Simple Moving Average of the small example data set, we will move on to use this same code to parse over thirty years worth of all daily stock closing prices.
-
Chris /
July 12, 2011 / 1:05 PM
Great tutorial, when are you going to publish part three?
-
Jon Zuanich /
July 12, 2011 / 1:34 PM
It is out my friend: http://blog.cloudera.com/blog/2011/04/simple-moving-average-secondary-sort-and-mapreduce-part-3/
Filed under
Share this post