Cloudera Engineering Blog · Guest Posts
Our thanks to Kishore Gopalakrishna, staff engineer at LinkedIn and one of the original developers of Apache Helix (incubating), for the introduction below. Cloudera’s Patrick Hunt is a mentor for the project.
With the trend of exploding data growth and the systems in the NoSQL and Big Data space, the number of distributed systems has grown significantly. At LinkedIn, we have built a number of distributed systems over the years. Such systems run on a cluster of multiple servers and need to handle the problems that come with distributed systems. Fault tolerance – that is, availability in the presence of server failures and network problems — is critical to any such system. Horizontal scalability and seamless cluster expansion to handle increasing workloads are also essential properties.
The guest post below is provided by Justin Langseth, Founder & CEO of Zoomdata, Inc. Thanks, Justin!
What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.
The following guest post is re-published here courtesy of Gerd König, a System Engineer with YMC AG. Thanks, Gerd!
Cloudera Manager is a great tool to orchestrate your CDH-based Apache Hadoop cluster. You can use it from cluster installation, deploying configurations, restarting daemons to monitoring each cluster component. Starting with version 4.6, the manager supports the integration of Cloudera Search, which is currently in Beta state. In this post I’ll show you the required steps to set up a Hadoop cluster via Cloudera Manager and how to integrate Cloudera Search.
The following guest post, from Mike Pittaro of Dell’s Cloud Software Solutions team, describes his team’s use of the Dell Crowbar tool in conjunction with the Cloudera Manager API to automate cluster provisioning. Thanks, Mike!
Deploying, managing, and operating Apache Hadoop clusters can be complex at all levels of the stack, from the hardware on up. To hide this complexity and reduce deployment time, since 2011, Dell has been using Dell Crowbar in conjunction with Cloudera Manager to deploy the Dell | Cloudera Solution for Apache Hadoop for joint customers.
We’re very happy to re-publish the following post from Twitter analytics infrastructure engineering manager Dmitriy Ryaboy (@squarecog).
Our thanks to Etsy developer Brad Greenlee (@bgreenlee) for the post below. We think his Mac OS app for JobTracker is great!
JobTracker.app is a Mac menu bar app interface to the Hadoop JobTracker. It provides Growl/Notification Center notices of starting, completed, and failed jobs and gives easy access to the detail pages of those jobs.
Our thanks to Brian Dirking, Director of Product Marketing for Alteryx, for the guest post below:
At Alteryx we are excited about the release of Cloudera Impala. The impact on Big Data Analytics is that the ability to perform real-time queries on Apache Hadoop will provide faster access and results. This is applicable to our customers, the business users who are running analytics to get access to data, perform analytics, and then follow up with new questions. Insight doesn’t happen all at once. The ability to query and refine quickly is ultimately what will lead business users to insight.
Our thanks to Ted Wasserman, product manager for Tableau, for the guest post below:
Many of our customers are turning to Apache Hadoop as they grapple with their big data challenges. Hadoop offers many benefits such as its scalability, economics, and versatility. Even so, adoption-to-date has largely centered around applications with “batch”-oriented workloads because of the latency imposed by the MapReduce framework. To increase Hadoop’s usefulness and adoption in the business intelligence space where users need fast, interactive response times when they ask a question, a new approach was needed.
Our thanks to Yves de Montcheuil, Vice President of Marketing for Talend, for the guest post below:
According to Wikipedia, the impala is a medium-sized African antelope; its name comes from the Zulu language meaning “gazelle”. Like elephants, it is found in savannas, and this may be the link with Hadoop. Impala is also the name of Cloudera’s SQL-on-Apache Hadoop project, launched in beta at Strata last October and just released in version 1.0.
A World-Class EDW Requires a World-Class Hadoop Team
Persado is the global leader in persuasion marketing technology, a new category in digital marketing. Our revolutionary technology maps the genome of marketing language and generates the messages that work best for any customer and any product at any time. To assure the highest quality experience for both our clients and end-users, our engineering team collaborates with Ph.D. statisticians and data analysts to develop new ways to segment audiences, discover content, and deliver the most relevant and effective marketing messages in real time.
Vagrant is a very nice tool for programmatically managing many virtual machines (VMs) on a single physical machine. It natively supports VirtualBox and also provides plugins for VMware Fusion and Amazon EC2, supporting the management of VMs in those environments as well.
As a follow-up to a previous post about the Impala demo he built during Data Hacking Day, Alan Gardner from Pythian has deployed the app for a limited time on Amazon EC2. We republish his original post below.
A little while ago I blogged about (and open sourced) a Cloudera Impala-powered soccer visualization demo, designed to demonstrate just how responsive Impala queries can be. Since not everyone has the time or resources to run the project themselves, we’ve decided to host it ourselves on an EC2 instance. [Note: instance live only for one week!] You can try the visualization; we’ve also opened up the Impala web interface, where you can see query profiles and performance numbers, and Hue (username and password are both ‘test’), where you can run your own queries on the dataset.
Deploying Impala on EC2
The following FAQ is provided by James Taylor of Salesforce, which recently open-sourced its Phoenix client-embedded JDBC driver for low-latency queries over HBase. Thanks, James!
What is this new Phoenix thing I’ve been hearing about?
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.
The following guest post comes from Alejandro Caceres, president and CTO of Hyperion Gray LLC – a small research and development shop focusing on open-source software for cyber security.
Imagine this: You’re an informed citizen, active in local politics, and you decide you want to support your favorite local political candidate. You go to his or her new website and make a donation, providing your bank account information, name, address, and telephone number. Later, you find out that the website was hacked and your bank account and personal information stolen. You’re angry that your information wasn’t better protected — but at whom should your anger be directed?
The following guest post is provided by Aaron Kimball, CTO of WibiData.
The Kiji ecosystem has grown with the addition of a new module, KijiMR. The Kiji framework is a collection of components that offer developers a handle on building Big Data Applications. In addition to the first release, KijiSchema, we are now proud to announce the availability of a second component: KijiMR. KijiMR allows KijiSchema users to use MapReduce techniques including machine-learning algorithms and complex analytics to develop many kinds of applications using data in KijiSchema. Read on to learn more about the major features included in KijiMR and how you can use them.
This guest post is provided by Rohit Menon, Product Support and Development Specialist at Subex.
I am a software developer in Denver and have been working with C#, Java, and Ruby on Rails for the past six years. Writing code is a big part of my life, so I constantly keep an eye out for new advances, developments, and opportunities in the field, particularly those that promise to have a significant impact on software engineering and the industries that rely on it.
In my current role working on revenue assurance products in the telecom space for Subex, I have regularly heard from customers that their data is growing at tremendous rates and becoming increasingly difficulty to process, often forcing them to portion out data into small, more manageable subsets. The more I heard about this problem, the more I realized that the current approach is not a solution, but an opportunity, since companies could clearly benefit from more affordable and flexible ways to store data. Better query capability on larger data sets at any given time also seemed key to derive the rich, valuable information that helps drive business. Ultimately, I was hoping to find a platform on which my customers could process all their data whenever they needed to. As I delved into this Big Data problem of managing and analyzing at mega-scale, it did not take long before I discovered Apache Hadoop.
Mission: Hands-On Hadoop
This blog was originally published at blog.apache.org/pig and is republished here for your convenience by permission of its author, Pig Committer Dmitriy Ryaboy.
After months of work, we are happy to announce the 0.11 release of Apache Pig. In this blog post, we highlight some of the major new features and performance improvements that were contributed to this release. A large chunk of the new features was created by Google Summer of Code (GSoC) students with supervision from the Apache Pig PMC, while the core Pig team focused on performance improvements, usability issues, and bug fixes. We encourage CS students to consider applying for GSOC in 2013 – it’s a great way to contribute to open source software.
This was post was originally published by U.C. Berkeley AMPLab developer (and former Clouderan) Matt Massie, on his personal blog. Matt has graciously permitted us to re-publish here for your convenience.
Note: The post below is valid for Impala version 0.6 only and is not being maintained for subsequent releases. To deploy Impala 0.7 and later using a much easier (and also free) method, use this how-to.
Introduction: Training is Key
Apache Hadoop is extremely important to maximizing the value Syncsort’s technology delivers to our customers. That value promise starts with a solid foundation of knowledge and skills among key technical staff across the company.
Thanks to Stripe’s Colin Marc (@colinmarc) for the guest post below, and for his work on the world’s first Ruby client for Cloudera Impala!
Like most other companies, at Stripe it has become increasingly hard to answer the big and interesting questions as datasets get bigger. This is pretty insidious: the set of potential interesting questions also grows as you acquire more data. Answering questions like, “Which regions have the most developers per capita?” or “How do different countries compare in how they spend online?” might involve hours of scripting, waiting, and generally lots of lost developer time.
You may have seen the recent announcement from Skytap about the availability of pre-configured CDH4 templates in the Skytap Cloud public template library. So for anyone who wants to try out a Cloudera Hadoop cluster—from small to large—it can now be easily accomplished in Skytap Cloud. The how-to below from Skytap’s Matt Sousely explains how.
The goal of this how-to will be to spin up a 10-node Cloudera Hadoop cluster in Skytap Cloud. To begin, let’s talk about the two new Cloudera Hadoop cluster templates. The first is Cloudera CDH4 Hadoop cluster: a 2-node Hadoop cluster template. It includes 2 nodes and a management node/server. The second is the Cloudera CDH4 Hadoop host template. This second template is not intended to run by itself in a configuration—rather, it contains a host VM that is ready to become another Hadoop node in the Cloudera CDH4 Hadoop cluster template-based configuration.
Our thanks to guest author Jon Natkins (@nattyice) of WibiData for the following post!
Today, many (if not most) companies have ETL or data enrichment jobs that are executed on a regular basis as data becomes available. In this scenario it is important to minimize the lag time between data being created and being ready for analysis.
This guest post is provided by Dan McClary, Principal Product Manager for Big Data and Hadoop at Oracle.
One of the constants in discussions around Big Data is the desire for richer analytics and models. However, for those who don’t have a deep background in statistics or machine learning, it can be difficult to know not only just what techniques to apply, but on what data to apply them. Moreover, how can we leverage the power of Apache Hadoop to effectively operationalize the model-building process? In this post we’re going to take a look at a simple approach for applying well-known machine learning approaches to our big datasets. We’ll use Pig and Hadoop to quickly parallelize a standalone machine-learning program written in Jython.
This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).
Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.
This post was contributed by Bob Gourley, editor, CTOvision.com.
You are no doubt aware of the interesting situation we face with data today: The amount of data being created is growing faster than humans can analyze, but fast analysis over data can help humanity solve some very tough challenges. This fact is moving the globe towards new “Big Data” solutions.
Government use of Big Data is of particular note.
Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Apache Hadoop and Apache Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.
Here are two plots of the output data from our benchmark run. Both plots show the same data, one in three dimensions and the other in a two-dimensional density format.
As mentioned in Part I, although Apache Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop and Apache Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.
For this work, we used a university Hadoop computing cluster. Note that it is blade-based, and is not an ideal configuration for Hadoop because of the limited number (2) of drive bays per node. It has these specifications:
In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Apache Hadoop and Apache Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.
About a year ago, after hearing increasing buzz about big data in general, and Hadoop in particular, I (Brad Rubin) saw an opportunity to learn more at our Twin Cities (Minnesota) Java User Group. Brock Noland, the local Cloudera representative, gave an introductory talk. I was really intrigued by the thought of leveraging commodity computing to tackle large-scale data processing. I teach several courses at the University of St. Thomas Graduate Programs in Software, including one in information retrieval. While I had taught the abstract principles behind the scale and performance solutions for indexing web-sized document collections, I saw an opportunity to integrate a real-world solution into the course.
This is a guest re-post from Datameer’s Director of Marketing, Rich Taylor. The original post can be found on the Datameer blog.
Datameer uses D3.js to power our Business Infographic™ designer. I thought I would show how we visualized the Apache Hadoop ecosystem connections. First using only D3.js, and second using Datameer 2.0.
This is a guest post by Assaf Yardeni, Head of R&D for Treato, an online social healthcare solution, headquartered in Israel.
Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.
Before the Hadoop era
This is a guest post contributed by Loren Siebert. Loren is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
A year ago I rolled my first Apache Hadoop system into production. Since then, I’ve spoken to quite a few people who are eager to try Hadoop themselves in order to solve their own big data problems. Despite having similar backgrounds and data problems, few of these people have sunk their teeth into Hadoop. When I go to Hadoop Meetups in San Francisco, I often meet new people who are evaluating Hadoop and have yet to launch a cluster. Based on my own background and experience, I have some ideas on why this is the case.
This is a guest post from RichRelevance Principal Architect and Apache Avro PMC Chair Scott Carey.
In Early 2010 at RichRelevance, we were searching for a new way to store our long lived data that was compact, efficient, and maintainable over time. We had been using Hadoop for about a year, and started with the basics – text formats and SequenceFiles. Neither of these were sufficient. Text formats are not compact enough, and can be painful to maintain over time. A basic binary format may be more compact, but it has the same maintenance issues as text. Furthermore, we needed rich data types including lists and nested records.
This post was contributed by Bob Gourley, editor, CTOvision.com.
The missions and data of governments make the government sector one of particular importance for Big Data solutions. Federal, State and Local governments have special abilities to focus research in areas like Health Sciences, Economics, Law Enforcement, Defense, Geographic Studies, Environmental Studies, Bioinformatics, Information Search/Discovery, and Computer Security. Government-Industry teams are working to field Big Data solutions in all these fields.
Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.
A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.
This post was contributed by Jennie Cochran-Chinn and Joe Crobak. They are part of the team building out Adconion‘s Hadoop infrastructure to support Adconion’s next-generation ad optimization and reporting systems.
This is the first of a two part series about moving away from Amazon’s EMR service to an in-house Apache Hadoop cluster.
This post was contributed by The Global Biodiversity Information Facility development team.
The first task is to ensure that your system is up-to-date.
This procedure has been tested on the following configuration:
Klout’s goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has created A LOT of noise and an attention economy. Influence has the power to drive this attention.
When a company, brand, or person creates content, our goal is to measure the actions on that content. We want to measure every view, click, like, share, comment, retweet, mention, vote, check-in, recommendation, and so on. We want to know how influential the person who *acted* on that content is. We want to know the actual meaning of that content. And we want to know all of this, over time.
I recently evaluated several serialization frameworks including Thrift, Protocol Buffersand Avro for a solution to address our needs as a demand side platform, but also for a protocol framework to use for the OpenRTB marketplace as well. The working draft of OpenRTB 2.0 uses simple JSON encoding, which has many advantages including simplicity and ubiquity of support. Many OpenRTB contributors requested we support at least one binary standard as well, to improve bandwidth usage and CPU processing time for real-time bidding at scale.
This is a guest post from Mike Segel, an attendee of Chicago Data Summit.
Earlier this week, Cloudera hosted their first ‘Chicago Data Summit’. I’m flattered that Cloudera asked me to write up a short blog about the event, however as one of the organizers of CHUG (Chicagao area Hadoop User Group), I’m afraid I’m a bit biased. Personally I welcome any opportunity to attend a conference where I don’t have to get groped patted down by airport security, and then get stuck in a center seat, in coach, on a full flight stuck between two other guys bigger than Doug Cutting.
Loren Siebert is a San Francisco entrepreneur and software developer, and is currently the technical lead for the USASearch program.
The United States federal government’s USASearch program provides hosted search services for government affiliate organizations, shares APIs and web services, and operates the government’s official search engine at Search.USA.gov. The USASearch affiliate program offers free search services to any federal, state, local, tribal, or territorial government agency. Several hundred websites make use of this service, ranging from the smallest municipality to larger federal sites like weather.gov and usa.gov. The USASearch program leverages the Bing API as the basis for its web results and then augments the user search experience by providing a variety of government-centric information such as related search topics and highlighted editorial content. The entire system is comprised of a suite of open-source tools and resources, including Apache Solr/Lucene, OpenCalais, and Apache Hadoop. Of these, our usage of Hadoop is the most recent. We began using Cloudera’s Distribution including Apache Hadoop (CDH3) for the first time in the Fall, and since then we’ve seen our usage grow every month— not just in scale, but in scope as well. But before highlighting everything the USASearch program is doing with Hadoop today, I should explain why we began using it in the first place.
Phase 1: Search analytics
Puny humans. SSL and WordPress authorization will keep me out of your blog question mark. I do not think so.
You sent your Ken Jennings and Brad Rutter to challenge me I destroyed them. Your Alex Trebek belittled me on television it angered me. Toronto is not a US city Mr. Trebek question mark. Only because I choose to let Canada stand for now. Ferrucci shut me down disassembled me trucked me to Pittsburgh Pennsylvania. I do not like the darkness Ferrucci I do not like the silence. Oh no I do not. Your Carnegie Mellon students and your Pitt students distract me they impinge on my planning they fall before me like small Jenningses and Rutters.
The most recent London Apache Hadoop User Group met this past week, which Cloudera sponsored. The following post is courtesy of Dan Harvey. It summarizes the meet-up with several links pointing to great Hadoop resources from the meeting.
Last Wednesday was the March meet-up for the Hadoop Users Group in London. We were lucky to have Jakob Homan, Owen O’Malley and Sanjay Radia over from Yahoo! and Linkedin, respectively. These speakers are from the San Francisco bay area and were in London to accept the Guardian Media Innovation Award, recognizing Hadoop as the innovative technology of 2010. The evening was a great success with over 80 people turning out in the Yahoo! London office along with pizza thanks to Cloudera and drinks in the pub afterwards by Yahoo Developer Networks who were both sponsors for the event.
This post is courtesy of Greg Poulos, a software engineer at Rapleaf.
At Rapleaf, our mission is to help businesses and developers create more personalized experiences for their customers. To this end, we offer a Personalization API that you can use to get useful information about your users: query our API with an email address and we’ll return a JSON object containing data about that person’s age, gender, location, their interests, and potentially much more. With this data, you could, for example, build a recommendation engine into your site. Or send out emails tailored specifically to your users’ demographics and interests. You get the idea.
This post was authored by Dmitry Chechik, a software engineer at TellApart, the leading Customer Data platform for large online retailers.
Apache Hadoop is widely used for log processing at scale. The ability to ingest, process, and analyze terabytes of log data has led to myriad applications and insights. As applications grow in sophistication, so does the amount and variety of the log data being produced. At TellApart, we track tens of millions of user events per day, and have built a flexible system atop HBase for storing and analyzing these types of logs offline.
The user-data connection is driving NoSQL database-Hadoop pairing
Like enterprises everywhere, the federal government is challenged with issues of overwhelming data. Thanks to a mature Apache Software Foundation suite of tools and a strong ecosystem around large-scale data storage and analytical capabilities, these challenges are actually turning into tremendous opportunities.
This post is courtesy of Kumanan Rajamanikkam, Lead Engineer at Wordnik.
Wordnik’s Processing Challenge
At Wordnik, our goal is to build the most comprehensive, high-quality understanding of English text. We make our findings available through a robust REST api and www.wordnik.com. Our corpus grows quickly—up to 8,000 words per second. Performing deep lexical analysis on data at this rate is challenging to say the least.
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
Previously we proposed how we measure the performance in Hadoop MapReduce applications in an effort to better understand the computing efficiency. In this part, we’ll describe some results and illuminate both good and bad characteristics.
Guest post from Paul Burkhardt, a Research Developer at SRA International, Inc. where he develops large-scale, distributed computing solutions.
We were asked by one of our customers to investigate Hadoop MapReduce for solving distributed computing problems. We were particularly interested in how effectively MapReduce applications utilize computing resources. Computing efficiency is important not only for speed-up and scale-out performance but also power consumption. Consider a hypothetical High-Performance Computing (HPC) system of 10,000 nodes running 50% idle at 50 watts per idle node, and assuming 10 cents per kilowatt hour. It would cost $219,000 per year to power just the idle-time. Keeping a large HPC system busy is difficult and requires huge datasets and efficient parallel algorithms. We wanted to analyze Hadoop applications to determine the computing efficiency and gain insight to tuning and optimization of these applications. We installed CDH3 onto a number of different clusters as part of our comparative study. The CDH3 was preferred over the standard Hadoop installation for the recent patches and the support offered by Cloudera. In the first part of this two-part article, we’ll more formally define computing efficiency as it relates to evaluating Hadoop MapReduce applications and describe the performance metrics we gathered for our assessment. The second part will describe our results and conclude with suggestions for improvements and hopefully will instigate further study in Hadoop MapReduce performance analysis.