Cloudera Developer Blog · Use Case Posts
“Are data warehouses becoming victims of their own success?”, Tony Baer asks in a recent blog post:
This week, the Cloudera Sessions head to Washington, DC, and Columbus, Ohio, where attendees will hear from AOL, Explorys, and Skybox Imaging about the ways Apache Hadoop can be used to optimize digital content, to improve the delivery of healthcare, and to generate high-resolution images of the entire globe that provide value to retailers, farmers, government organizations and more.
I’d like to take this opportunity to shine a spotlight on Skybox Imaging, an innovative company that is putting Hadoop to work to help us see the world more clearly, literally.
Skybox’s vice president of ground software, Ollie Guinan, recently posted a guest blog to Cloudera.com to give readers a glimpse into their Hadoop use case, which I’d like to promote again here. I would encourage anyone in the DC area to meet Ollie (who is also a Champion of Big Data) in person at the Cloudera Sessions event in DC this Tuesday to learn more about Skybox and its fascinating use case.
This week represents quite a milestone for Cloudera and, at least we’d like to believe, the Hadoop ecosystem at large: the general availability release of Cloudera Impala. Since we launched the Impala beta program last fall, I’ve been fortunate enough to work with many of the 40+ early adopters who’ve been testing this near-real-time SQL-on-Hadoop engine in an effort to learn about their use cases and keep tabs on early experiences with the tool.
Customers running Impala today span a variety of industries, from large biotech company to online travel provider to digital advertiser to major financial institution, and each one has a unique use case for Impala. Stay tuned to learn more about their various use cases.
This week, I’d like to highlight Six3 Systems’ Wayne Wheeles (also a Champion of Big Data), who has been working with Impala to improve cyber security solutions, in particular the open source SherpaSurfing product.
As Cloudera’s keeper of customer stories, it’s dawned on me that others might benefit from the information I’ve spent the past year collecting: the many use cases and deployment patterns for Hadoop amongst our customer base.
This week I’d like to highlight Nokia, a global company that we’re all familiar with as a large mobile phone provider, and whose Senior Director of Analytics – Amy O’Connor – will be speaking at tomorrow’s Cloudera Sessions event in Boston.
Fun fact: Nokia has been in business for more than 150 years, starting with the production of paper in the 1800s. When I first met Amy O’Connor in early 2012, she explained to me that Nokia has always been in the business of transforming resources into useful products — from paper and rubber over a century ago, to the electronics and mobile devices we’re familiar with today.
The following guest post comes from Alejandro Caceres, president and CTO of Hyperion Gray LLC – a small research and development shop focusing on open-source software for cyber security.
Imagine this: You’re an informed citizen, active in local politics, and you decide you want to support your favorite local political candidate. You go to his or her new website and make a donation, providing your bank account information, name, address, and telephone number. Later, you find out that the website was hacked and your bank account and personal information stolen. You’re angry that your information wasn’t better protected — but at whom should your anger be directed?
Who is responsible for the generally weak condition of website security, today? It can’t be website operators, because there’s no prerequisite to know about blind SQL injection attacks or validation filters before spinning up a website. It can’t be website developers either — we definitely don’t equip them to evaluate website security for themselves. It’s a pretty small community that focuses on web development and web security, and that community is pretty opaque.
This guest post is provided by Rohit Menon, Product Support and Development Specialist at Subex.
I am a software developer in Denver and have been working with C#, Java, and Ruby on Rails for the past six years. Writing code is a big part of my life, so I constantly keep an eye out for new advances, developments, and opportunities in the field, particularly those that promise to have a significant impact on software engineering and the industries that rely on it.
In my current role working on revenue assurance products in the telecom space for Subex, I have regularly heard from customers that their data is growing at tremendous rates and becoming increasingly difficulty to process, often forcing them to portion out data into small, more manageable subsets. The more I heard about this problem, the more I realized that the current approach is not a solution, but an opportunity, since companies could clearly benefit from more affordable and flexible ways to store data. Better query capability on larger data sets at any given time also seemed key to derive the rich, valuable information that helps drive business. Ultimately, I was hoping to find a platform on which my customers could process all their data whenever they needed to. As I delved into this Big Data problem of managing and analyzing at mega-scale, it did not take long before I discovered Apache Hadoop.
Mission: Hands-On Hadoop
My initial reading about Hadoop on the various blogs and forums had me convinced that it is easily one of the best tools out there for handling and processing large volumes of data. At first, I thought I’d be able to learn Hadoop on my own by reading Hadoop: The Definitive Guide and the Hadoop Tutorial from Yahoo! However, after only a few days of reading, it became clear that I would benefit greatly from direct interaction with Hadoop experts, supervised experimentation, and interaction with practical examples of Hadoop challenges from the field.
Now that Apache Hadoop is seven years old, use-case patterns for Big Data have emerged. In this post, I’m going to describe the three main ones (reflected in the post’s title) that we see across Cloudera’s growing customer base.
Transformations (T, for short) are a fundamental part of BI systems: They are the process through which data is converted from a source format (which can be relational or otherwise) into a relational data model that can be queried via BI tools.
In the late 1980s, the first BI data stacks started to materialize, and they typically looked like Figure 1.
Because raising the visibility of Apache Hadoop use cases is so important, in this post we bring you a re-posted story about how and why Rapleaf, a marketing data company based in San Francisco, uses Cloudera Enterprise (CDH and Cloudera Manager).
Founded in 2006, Rapleaf’s mission is to make it incredibly easy for marketers to access the data they need so they can personalize content for their customers. Rapleaf helps clients “fill in the blanks” about their customers by taking contact lists and, in real time, providing supplemental data points, statistics and aggregate charts and graphs that are guaranteed to have greater than 90% accuracy. Rapleaf is powered by Cloudera.
Business Challenges Before Cloudera
Rapleaf established itself as a data driven business early on, collecting feeds from numerous sources to create a single, accurate view of each customer. By 2008, “we were processing data in a complex pipeline that involved an organic structure of many MySQL instances and queues,” explained Rapleaf’s co-founder and vice president of engineering, Jeremy Lizt. “As data volumes increased, that structure became unmanageable and expensive. It started getting difficult to perform the kinds of operations that we wanted to be able to do. It was no secret that this wasn’t going to scale.”
This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at http://github.com/cloudera/mapreduce-tutorial. The program will run on either MR1 or MR2.
We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation. We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.
The Use Case
There’s been a lot of brawling on our pirate ship recently. Not so rarely, one of the mates will punch another one in the mouth, knocking a tooth out onto the deck. Our poor sailors will wake up the next day with an empty bottle of rum, wondering who’s responsible for the gap between their teeth. All this violence has gotten out of hand, so as a deterrent, we’d like to provide everyone with a list of everyone that’s ever left them with a gap. Luckily, we’ve been able to set up a Flume source so that every time someone punches someone else, it gets written out as a line in a big log file in Hadoop. To turn this data into these lists, we need a MapReduce job that can 1) invert the mapping from attacker to their victim, 2) group by victims, and 3) eliminate duplicates.
The Input Data
At Cloudera, we put great pride into drinking our own champagne. That pride extends to our support team, in particular.
Cloudera Manager, our end-to-end management platform for CDH (Cloudera’s open-source, enterprise-ready distribution of Apache Hadoop and related projects), has a feature that allows subscription customers to send a snapshot of their cluster to us. When these cluster snapshots come to us from customers, they end up in a CDH cluster at Cloudera where various forms of data processing and aggregation can be performed.
Today, the system provides real-time support via an application we call Cloudera Support Interface (CSI). When a support employee looks at a ticket, they can use CSI to examine the customer’s latest snapshot and see cluster stats such as version information, number of nodes in service, which services are used, and so on. CSI also visualizes different aggregations and groupings, such as versions, which allows us to detect misconfigured clusters, or issues caused during upgrade or installation.