Cloudera Developer Blog · Hive Posts
Today we bring you a brief interview with Alex Holmes, author of the new book, Hadoop in Practice (Manning). You can learn more about the book and download a free sample chapter here.
There are a few good Hadoop books on the market right now. Why did you decide to write this book, and how is it complementary to them?
When I started working with Hadoop I leaned heavily on Tom White’s excellent book, Hadoop: The Definitive Guide (O’Reilly Media), to learn about MapReduce and how the internals of Hadoop worked. As my experience grew and I started working with Hadoop in production environments I had to figure out how to solve problems such as moving data in and out of Hadoop, using compression without destroying data locality, performing advanced joining techniques and so on. These items didn’t have a lot of coverage in existing Hadoop books, and that’s really the idea behind Hadoop in Practice – it’s a collection of real-world recipes that I learned the hard way over the years.
After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.
When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.
In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.
Who is Influential?
Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.
Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Apache Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.
While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.
As mentioned in Part I, although Apache Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop and Apache Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.
For this work, we used a university Hadoop computing cluster. Note that it is blade-based, and is not an ideal configuration for Hadoop because of the limited number (2) of drive bays per node. It has these specifications:
In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Apache Hadoop and Apache Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.
About a year ago, after hearing increasing buzz about big data in general, and Hadoop in particular, I (Brad Rubin) saw an opportunity to learn more at our Twin Cities (Minnesota) Java User Group. Brock Noland, the local Cloudera representative, gave an introductory talk. I was really intrigued by the thought of leveraging commodity computing to tackle large-scale data processing. I teach several courses at the University of St. Thomas Graduate Programs in Software, including one in information retrieval. While I had taught the abstract principles behind the scale and performance solutions for indexing web-sized document collections, I saw an opportunity to integrate a real-world solution into the course.