Cloudera Developer Blog · Use Case Posts
The following guest post comes from Alejandro Caceres, president and CTO of Hyperion Gray LLC – a small research and development shop focusing on open-source software for cyber security.
Imagine this: You’re an informed citizen, active in local politics, and you decide you want to support your favorite local political candidate. You go to his or her new website and make a donation, providing your bank account information, name, address, and telephone number. Later, you find out that the website was hacked and your bank account and personal information stolen. You’re angry that your information wasn’t better protected — but at whom should your anger be directed?
This guest post is provided by Rohit Menon, Product Support and Development Specialist at Subex.
I am a software developer in Denver and have been working with C#, Java, and Ruby on Rails for the past six years. Writing code is a big part of my life, so I constantly keep an eye out for new advances, developments, and opportunities in the field, particularly those that promise to have a significant impact on software engineering and the industries that rely on it.
In my current role working on revenue assurance products in the telecom space for Subex, I have regularly heard from customers that their data is growing at tremendous rates and becoming increasingly difficulty to process, often forcing them to portion out data into small, more manageable subsets. The more I heard about this problem, the more I realized that the current approach is not a solution, but an opportunity, since companies could clearly benefit from more affordable and flexible ways to store data. Better query capability on larger data sets at any given time also seemed key to derive the rich, valuable information that helps drive business. Ultimately, I was hoping to find a platform on which my customers could process all their data whenever they needed to. As I delved into this Big Data problem of managing and analyzing at mega-scale, it did not take long before I discovered Apache Hadoop.
Mission: Hands-On Hadoop
Now that Apache Hadoop is seven years old, use-case patterns for Big Data have emerged. In this post, I’m going to describe the three main ones (reflected in the post’s title) that we see across Cloudera’s growing customer base.
Transformations (T, for short) are a fundamental part of BI systems: They are the process through which data is converted from a source format (which can be relational or otherwise) into a relational data model that can be queried via BI tools.
Because raising the visibility of Apache Hadoop use cases is so important, in this post we bring you a re-posted story about how and why Rapleaf, a marketing data company based in San Francisco, uses Cloudera Enterprise (CDH and Cloudera Manager).
Founded in 2006, Rapleaf’s mission is to make it incredibly easy for marketers to access the data they need so they can personalize content for their customers. Rapleaf helps clients “fill in the blanks” about their customers by taking contact lists and, in real time, providing supplemental data points, statistics and aggregate charts and graphs that are guaranteed to have greater than 90% accuracy. Rapleaf is powered by Cloudera.
Business Challenges Before Cloudera
This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at http://github.com/cloudera/mapreduce-tutorial. The program will run on either MR1 or MR2.
We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation. We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.
The Use Case
At Cloudera, we put great pride into drinking our own champagne. That pride extends to our support team, in particular.
Cloudera Manager, our end-to-end management platform for CDH (Cloudera’s open-source, enterprise-ready distribution of Apache Hadoop and related projects), has a feature that allows subscription customers to send a snapshot of their cluster to us. When these cluster snapshots come to us from customers, they end up in a CDH cluster at Cloudera where various forms of data processing and aggregation can be performed.
The following is a re-post from CTOVision.com.
The Government Big Data Solutions Award was established to highlight innovative solutions and facilitate the exchange of best practices, lessons learned and creative ideas for addressing Big Data challenges. The Top Five Nominees of 2012 were chosen for criteria that included:
The following is a re-post from Bob Gourley of CTOVision.com.
The amount of data being created in governments is growing faster than humans can analyze. But analysis can solve tough challenges. Those two facts are driving the continual pursuit of new Big Data solutions. Big Data solutions are of particular importance in government. The government has special abilities to focus research in areas like Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, and Computer Security. Each of those area can be well served by Big Data approaches, and each has exemplars of solutions worthy of highlighting to the community.
This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.
In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.
This is a guest post by Oliver Guinan, VP Ground Software, at Skybox Imaging. Oliver is a 15-year veteran of the internet industry and is responsible for all ground system design, architecture and implementation at Skybox.
One of the great promises of the big data movement is using networks of ubiquitous sensors to deliver insights about the world around us. Skybox Imaging is attempting to do just that for millions of locations across our planet.