Cloudera Developer Blog · Use Case Posts
The following is a re-post from CTOVision.com.
The Government Big Data Solutions Award was established to highlight innovative solutions and facilitate the exchange of best practices, lessons learned and creative ideas for addressing Big Data challenges. The Top Five Nominees of 2012 were chosen for criteria that included:
The following is a re-post from Bob Gourley of CTOVision.com.
The amount of data being created in governments is growing faster than humans can analyze. But analysis can solve tough challenges. Those two facts are driving the continual pursuit of new Big Data solutions. Big Data solutions are of particular importance in government. The government has special abilities to focus research in areas like Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, and Computer Security. Each of those area can be well served by Big Data approaches, and each has exemplars of solutions worthy of highlighting to the community.
The Government Big Data Solutions Award was established to help highlight some of the best innovation in the federal space. The 2012 award process solicited nominations from across federal, state and local governments. Nominations were evaluated based on how well submissions addressed three key factors:
This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.
In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.
In this article, we’ll learn more about Hive, its strengths and weaknesses, and why Hive is the right choice for analyzing tweets in this application.
This is a guest post by Oliver Guinan, VP Ground Software, at Skybox Imaging. Oliver is a 15-year veteran of the internet industry and is responsible for all ground system design, architecture and implementation at Skybox.
One of the great promises of the big data movement is using networks of ubiquitous sensors to deliver insights about the world around us. Skybox Imaging is attempting to do just that for millions of locations across our planet.
Skybox is developing a low cost imaging satellite system and web-accessible big data processing platform that will capture video or images of any location on Earth within a couple of days. The low cost nature of the satellite opens the possibility of deploying tens of satellites which, when integrated together, have the potential to image any spot on Earth within an hour.
This is the second article in a series about analyzing Twitter data using some of the components of the Hadoop ecosystem available in CDH, Cloudera’s open-source distribution of Apache Hadoop and related projects. In the first article, you learned how to pull CDH components together into a single cohesive application, but to really appreciate the flexibility of each of these components, we need to dive deeper.
Every story has a beginning, and every data pipeline has a source. So, to build Hadoop applications, we need to get data from a source into HDFS.
Apache Flume is one way to bring data into HDFS using CDH. The Apache Flume website describes Flume as “a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data.” At the most basic level, Flume enables applications to collect data from its origin and send it to a resting location, such as HDFS. At a slightly more detailed level, Flume achieves this goal by defining dataflows consisting of three primary structures: sources, channels and sinks. The pieces of data that flow through Flume are called events, and the processes that run the dataflow are called agents.
We at Cloudera are tremendously excited by the power of data to effect large-scale change in the healthcare industry. Many of the projects that our data science team worked on in the past year originated as data-intensive problems in healthcare, such as analyzing adverse drug events and constructing case-control studies. Last summer, we announced that our Chief Scientist Jeff Hammerbacher would be collaborating with the Mt. Sinai School of Medicine to leverage large-scale data analysis with Apache Hadoop for the treatment and prevention of disease. And next week, it will be my great pleasure to host a panel of data scientists and researchers at the Strata Rx Conference (register with discount code SHARON for 25% off) to discuss the meaningful use of natural language processing in clinical care.
Of course, the cost-effective storage and analysis of massive quantities of text is one of Hadoop’s strengths, and Jimmy Lin’s book on text processing is an excellent way to learn how to think in MapReduce. But a close study of how the applications of natural language processing technology in healthcare have evolved over the last few years is instructive for anyone who wants to understand how to use data science in order to tackle seemingly intractable problems.
Lesson 1: Choose the Right Problem
- Collect a lot of dirty, unstructured data.
- Hire a data scientist.
This guest post is provided by Dan McClary, Principal Product Manager for Big Data and Hadoop at Oracle.
One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.
I’d like to predict the weather. Heck, we all would – there’s personal and business value in knowing the likelihood of sun, rain, or snow. Do I need an umbrella? Can I sell more umbrellas? Better yet, groups like the National Climatic Data Center offer public access to weather data stretching back to the 1930s. I’ve got a question I want to answer and some big data with which to do it. On first reaction, because I want to do machine learning on data stored in HDFS, I might be tempted to reach for a massively scalable machine learning library like Mahout.
What’s to love about Cloudera Enterprise? A lot! But rather than bury you in documentation today, we’d rather bring you a less-than-two-minute-long video:
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
Table 1. Summary of Hadoop workloads analyzed
Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Apache Hadoop and Apache Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.
Here are two plots of the output data from our benchmark run. Both plots show the same data, one in three dimensions and the other in a two-dimensional density format.