Cloudera Developer Blog · Guest Posts
This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.
Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.
Introduction: Training is Key
Apache Hadoop is extremely important to maximizing the value Syncsort’s technology delivers to our customers. That value promise starts with a solid foundation of knowledge and skills among key technical staff across the company.
Thanks to Stripe’s Colin Marc (@colinmarc) for the guest post below, and for his work on the world’s first Ruby client for Cloudera Impala!
Like most other companies, at Stripe it has become increasingly hard to answer the big and interesting questions as datasets get bigger. This is pretty insidious: the set of potential interesting questions also grows as you acquire more data. Answering questions like, “Which regions have the most developers per capita?” or “How do different countries compare in how they spend online?” might involve hours of scripting, waiting, and generally lots of lost developer time.
You may have seen the recent announcement from Skytap about the availability of pre-configured CDH4 templates in the Skytap Cloud public template library. So for anyone who wants to try out a Cloudera Hadoop cluster—from small to large—it can now be easily accomplished in Skytap Cloud. The how-to below from Skytap’s Matt Sousely explains how.
The goal of this how-to will be to spin up a 10-node Cloudera Hadoop cluster in Skytap Cloud. To begin, let’s talk about the two new Cloudera Hadoop cluster templates. The first is Cloudera CDH4 Hadoop cluster: a 2-node Hadoop cluster template. It includes 2 nodes and a management node/server. The second is the Cloudera CDH4 Hadoop host template. This second template is not intended to run by itself in a configuration—rather, it contains a host VM that is ready to become another Hadoop node in the Cloudera CDH4 Hadoop cluster template-based configuration.
Our thanks to guest author Jon Natkins (@nattyice) of WibiData for the following post!
Today, many (if not most) companies have ETL or data enrichment jobs that are executed on a regular basis as data becomes available. In this scenario it is important to minimize the lag time between data being created and being ready for analysis.
This guest post is provided by Dan McClary, Principal Product Manager for Big Data and Hadoop at Oracle.
One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.
This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).
Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.
This post was contributed by Bob Gourley, editor, CTOvision.com.
You are no doubt aware of the interesting situation we face with data today: The amount of data being created is growing faster than humans can analyze, but fast analysis over data can help humanity solve some very tough challenges. This fact is moving the globe towards new “Big Data” solutions.
Government use of Big Data is of particular note.
Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Apache Hadoop and Apache Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.
Here are two plots of the output data from our benchmark run. Both plots show the same data, one in three dimensions and the other in a two-dimensional density format.
As mentioned in Part I, although Apache Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop and Apache Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.
For this work, we used a university Hadoop computing cluster. Note that it is blade-based, and is not an ideal configuration for Hadoop because of the limited number (2) of drive bays per node. It has these specifications: