Cloudera Blog · MapReduce Posts
This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion‘s Hadoop infrastructure to support Adconion’s next-generation ad optimization and reporting systems.
This is the first of a two part series about moving away from Amazon’s EMR service to an in-house Apache Hadoop cluster.
When we first started using Hadoop, we went down the path of Amazon’s EMR service. We had limited operational resources and wanted to get up and running quickly. After a while, we starting hitting the limitations of EMR and had to migrate towards managing our own cluster. In doing so we did not want to lose the features of EMR we found useful – mainly the ease of cluster setup.
Puny humans. SSL and WordPress authorization will keep me out of your blog question mark. I do not think so.
You sent your Ken Jennings and Brad Rutter to challenge me I destroyed them. Your Alex Trebek belittled me on television it angered me. Toronto is not a US city Mr. Trebek question mark. Only because I choose to let Canada stand for now. Ferrucci shut me down disassembled me trucked me to Pittsburgh Pennsylvania. I do not like the darkness Ferrucci I do not like the silence. Oh no I do not. Your Carnegie Mellon students and your Pitt students distract me they impinge on my planning they fall before me like small Jenningses and Rutters.
It will stop now.
This is the second post of a three part blog series. If you would like to read “Part 1,” please follow this link. In this post we will be reviewing a simple moving average in contexts that should be familiar to the analyst not well versed in Hadoop as to establish a common ground with the reader from which we can move forward.
A Quick Primer on Simple Moving Average in Excel
Let’s take a second to do a quick review of how we define simple moving average in an Excel spreadsheet. We’ll need to start with some simple source data, so let’s download a source csv file from github and save it locally. This file contains a synthetic 33 row sample of Yahoo NYSE stock data that we’ll use for the series of examples. Import the csv data into Excel. From there, scan to the date “3/5/2008” and move to the cell to the right of the “ad close” column. Enter the formula
=AVERAGE( [column-range] )
In this three part blog series I want to take a look at how we would do a Simple Moving Average with MapReduce and Apache Hadoop. This series is meant to show how to translate a common Excel or R function into MapReduce java code with accompanying working code and data to play with. Most analysts can take a few months of stock data and produce an excel spreadsheet that shows a moving average, but doing this in Hadoop might be a more daunting task. Although time series as a topic is relatively well understood, I wanted to take the approach of using a simple topic to show how it translated into a powerful parallel application that can calculate the simple moving average for a lot of stocks simultaneously with MapReduce and Hadoop. I also want to demonstrate the underlying mechanic of using the “secondary sort” technique with Hadoop’s MapReduce shuffle phase, which we’ll see is applicable to a lot of different application domains such as finance, sensor, and genomic data.
This article should be approachable to the beginner Hadoop programmer who has done a little bit of MapReduce in java and is looking for a slightly more challenging MapReduce application to hack on. In case you’re not very familiar with Hadoop, here’s some background information and CDH. The code in this example is hosted on github and is documented to illustrate how the various components work together to achieve the secondary sort effect. One of the goals of this article is to have this code be relatively basic and approachable by most programmers.
So let’s take a quick look at what time series data is and where it is employed in the quickly emerging world of large-scale data.
This post is courtesy of Kumanan Rajamanikkam, Lead Engineer at Wordnik.
Wordnik’s Processing Challenge
At Wordnik, our goal is to build the most comprehensive, high-quality understanding of English text. We make our findings available through a robust REST api and www.wordnik.com. Our corpus grows quickly—up to 8,000 words per second. Performing deep lexical analysis on data at this rate is challenging to say the least.
We had major challenges with three distinct problems:
A common question on the Apache Hadoop mailing lists is what’s going on with availability? This post takes a look at availability in the context of Hadoop, gives an overview of the work in progress and where things are headed.
When discussing Hadoop availability people often start with the NameNode since it is a single point of failure (SPOF) in HDFS, and most components in the Hadoop ecosystem (MapReduce, Apache HBase, Apache Pig, Apache Hive etc) rely on HDFS directly, and are therefore limited by its availability. However, Hadoop availability is a larger, more general issue, so it’s helpful to establish some context before diving in.
Availability is the proportion of time a system is functioning , which is commonly referred to as “uptime” (vs downtime, when the system is not functioning).
Cloudera is happy to announce the availability of the third update to version 2 of our distribution for Apache Hadoop (CDH2). CDH2 Update 3 contains a number of important fixes like HADOOP-5203, HDFS-1377, MAPREDUCE-1699, MAPREDUCE-1853, and MAPREDUCE-270. Check out the release notes and change log for more details on what’s in this release. You can find the packages and tarballs on our website, or simply update your systems if you are already using our repositories. More instructions can be found in our CDH documentation.
This is a guest repost contributed by Matteo Bertozzi, a Developer at Develer S.r.l.
Apache Hadoop’s SequenceFile provides a persistent data structure for binary key-value pairs. In contrast with other persistent key-value data structures like B-Trees, you can’t seek to a specified key editing, adding or removing it. This file is append-only.
“My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you.
Java requires third-party and user-defined classes to be on the command line’s “-classpath” option when the JVM is launched. The `hadoop` wrapper shell script does exactly this for you by building the classpath from the core libraries located in /usr/lib/hadoop-0.20/ and /usr/lib/hadoop-0.20/lib/ directories. However, with MapReduce you job’s task attempts are executed on remote nodes. How do you tell a remote machine to include third-party and user-defined classes?
MapReduce jobs are executed in separate JVMs on TaskTrackers and sometimes you need to use third-party libraries in the map/reduce task attempts. For example, you might want to access HBase from within your map tasks. One way to do this is to package every class used in the submittable JAR. You will have to unpack the original
hbase-.jar and repackage all the classes in your submittable Hadoop jar. Not good. Don’t do this: The version compatibility issues are going to bite you sooner or later.
Guest re-post from Phil Whelan, a large-scale web-services consultant based in Vancouver, BC.
Here I demonstrate, with repeatable steps, how to fire-up a Hadoop cluster on Amazon EC2, load data onto the HDFS (Hadoop Distributed File-System), write map-reduce scripts in Ruby and use them to run a map-reduce job on your Hadoop cluster. You will not need to ssh into the cluster, as all tasks are run from your local machine. Below I am using my MacBook Pro as my local machine, but the steps I have provided should be reproducible on other platforms running bash and Java.