Cloudera Engineering Blog · Hive Posts
As mentioned in Part I, although Apache Hadoop and other Big Data technologies are typically applied to I/O intensive workloads, where parallel data channels dramatically increase I/O throughput, there is growing interest in applying these technologies to CPU intensive workloads. In this work, we used Hadoop and Hive to digitally signal process individual neuron voltage signals captured from electrodes embedded in the rat brain. Previously, this processing was performed on a single Matlab workstation, a workload that was both CPU intensive and data intensive, especially for intermediate output data. With Hadoop and Apache Hive, we were not only able to apply parallelism to the various processing steps, but had the additional benefit of having all the data online for additional ad hoc analysis. Here, we describe the technical details of our implementation, including the biological relevance of the neural signals and analysis parameters. In Part III, we will then describe the tradeoffs between the Matlab and Hadoop/Hive approach, performance results, and several issues identified with using Hadoop/Hive in this type of application.
For this work, we used a university Hadoop computing cluster. Note that it is blade-based, and is not an ideal configuration for Hadoop because of the limited number (2) of drive bays per node. It has these specifications:
In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Apache Hadoop and Apache Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.
About a year ago, after hearing increasing buzz about big data in general, and Hadoop in particular, I (Brad Rubin) saw an opportunity to learn more at our Twin Cities (Minnesota) Java User Group. Brock Noland, the local Cloudera representative, gave an introductory talk. I was really intrigued by the thought of leveraging commodity computing to tackle large-scale data processing. I teach several courses at the University of St. Thomas Graduate Programs in Software, including one in information retrieval. While I had taught the abstract principles behind the scale and performance solutions for indexing web-sized document collections, I saw an opportunity to integrate a real-world solution into the course.
This past Monday marked the official release of Apache Hive 0.9.0. Users interested in taking this release of Hive for a spin can download a copy from the Apache archive site. The following post is a quick summary of new features and improvements users can expect to find in this update of the popular data warehousing system for Hadoop.
The 0.9.0 release continues the trend of extending Hive’s SQL support. Hive now understands the BETWEEN operator and the NULL-safe equality operator, plus several new user defined functions (UDF) have now been added. New UDFs include printf(), sort_array(), and java_method(). Also, the concat_ws() function has been modified to support input parameters consisting of arrays of strings.
Earlier today, Cloudera proudly released the Cloudera Connector for Tableau. The availability of this connector serves both Tableau users who are looking to expand the volume of datasets they manipulate and Hadoop users who want to enable analysts like Tableau users to make the data within Hadoop more meaningful. Enterprises can now extract the full value of big data and allow a new class of power users to interact with Hadoop data in ways they priorly could not.
The Cloudera Connector for Tableau is a free ODBC Driver that enables Tableau Desktop 7.0 to connect to Apache Hive. Tableau users can thus leverage Hive, Hadoop’s data warehouse system, as a data source for all the maps, charts, dashboards and other artifacts typically generated within Tableau.
The Apache Hive team is hard at work putting the finishing touches on the 0.8.0 release. While the release hasn’t reached the GA milestone yet, I think now would be a good time to start highlighting some of the new features and improvements that users can expect to find in this important update:
The infrastructure required to support table indexes was originally added in the 0.7.0 release, but at the time no viable indexing plugin was provided. Project contributors have remedied this situation in the 0.8.0 release with the inclusion of support for bitmap indexes. This is a very important addition to Hive since it promises to significantly increase the performance of queries on indexed tables. More information about Hive Table Indexes can be found in the original design document, as well as in the comments that accompany the Bitmap Index JIRA ticket.
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.
Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution. For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.
For those of you who are recent Cloudera users, here is a refresh on our update policy:
This post was contributed by Jonathan Seidman from Orbitz. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide . You can hear more from Jonathan at Hadoop World October 12th in NYC.
Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub, Additionally, the company operates business-to-business service: Orbitz Worldwide Distribution provides third parties such as Amtrak, Delta, LAN, KLM, Air France and a number of other leading airlines hotel booking capabilities, and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients. The Orbitz Worldwide sites process millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day. Not all of that data necessarily has value, but much of it does. Unfortunately storing and processing all of that data in our existing data warehouse infrastructure is impractical because of expense and space considerations.
Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.
We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.
With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.
If you’re not familiar with CDH3b2, here’s what you need to know.
Announcing Two New Training Classes from Cloudera: Introduction to HBase and Analyzing Data with Hive and Pig
Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your organization. Please contact firstname.lastname@example.org for more information regarding on-site training, or visit www.cloudera.com/hadoop-training to view our public course schedule.
Cloudera’s HBase course discusses use-cases for HBase, and covers the HBase architecture, schema modeling, access patterns, and performance considerations. During hands-on exercises, students write code to access HBase from Java applications, and use the HBase shell to manipulate data. Introduction to HBase also covers deployment and advanced features.
CDH3 beta 2 includes Apache Hive 0.5.0, the latest version of the popular open source Apache Hadoop data warehouse platform. Hive allows you to express data analysis tasks in a dialect of SQL called HiveQL, and then compiles these tasks into MapReduce jobs and executes the jobs on your Hadoop cluster. Hive is a natural entry point to Hadoop for people who have prior experience with relational databases, but even those who have never written a line of SQL should give it a chance since it is currently the only Hadoop dataflow programming platform to provide built-in facilities for managing metadata. This unique feature of Hive allows you to access your data through a Table abstraction, making it possible to cleanly separate your analysis logic from the details of how your data is formatted and parsed. This results in scripts that are easier to write and much easier to maintain.
While Hive is great it on its own, it’s even better when you connect it to other tools in the Hadoop ecosystem. Users can currently use Sqoop to import data from relational databases into Hive, run Hive jobs inside Oozie workflows, and design queries in the Beeswax query editor that comes included with Hue. Hive 0.6.0 will include new features that make it possible to seamlessly access HBase tables from Hive, and there is also work afoot to provide an integration point between Hive and Flume.
Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.
To deal with this challenge, the Yahoo! engineering team created Oozie – the Hadoop workflow engine. We are pleased to provide Oozie with Cloudera’s distribution for Hadoop starting with the beta-2 release.
Why create a new workflow system?
This post was contributed by John Sichi, a committer on the Apache Hive project and a member of the Data Infrastructure team at Facebook.
As many readers may already know, Hive was initially developed at Facebook for dealing with explosive growth in our multi-petabyte data warehouse. Since its release as an Apache project, it has been put into use at a number of other companies for solving big data problems. Hive storage is based on Hadoop‘s underlying append-only filesystem architecture, meaning that it is ideal for capturing and analyzing streams of events (e.g. web logs). However, a data warehouse also has to relate these event streams to application objects; in Facebook’s case, these include familiar items such as fan pages, user profiles, photo albums, or status messages.
Hive can store this information easily, even for hundreds of millions of users, but keeping the warehouse up to date with the latest information published by users can be a challenge, as the append-only constraint makes it impossible to directly apply individual updates to warehouse tables. Up until now, the only practical option has been to periodically pull snapshots of all of the information from live MySQL databases and dump them to new Hive partitions. This is a costly operation, meaning it can be done at most daily (leading to stale data in the warehouse), and does not scale well as data volumes continue to shoot through the roof.
Around the globe, more and more companies are turning to Hadoop to tackle data processing problems that don’t lend themselves well to traditional systems. Users in the community consistently ask us to offer training in more places and expand our course offerings, and those who have obtained certification have reported great success connecting with companies investing in Hadoop. All of this keeps us pretty excited about the long term prospects for Hadoop.
We recently announced our first international developer training sessions in Tokyo (sold out, waitlist available) and Taiwan, and we’re happy to follow up with sessions in the EU. We’ll be visiting London the first week of June, and Berlin the next. If you’ll be in Berlin that week, be sure to check out the Berlin Buzzwords conference – a two day event focused on Hadoop, Lucene, and NoSQL.
We’re proud to announce that Cloudera’s Distribution for Hadoop Version 2 (CDH2) is officially released.
We’ve come a long way to get to a production quality release. At the beginning of September we announced the first beta of CDH2. After 6 months of additional testing we announced a release candidate. The release candidate spent over a month hardening in Cloudera’s internal QA process and on a wide variety of customer clusters. CDH2 is now stable and ready for use – we are pleased to recommend it to all our production users.
In September 2009, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable. It’s a long road of feedback, bug fixes, QA and testing to move from testing to stable. As someone who tracks the maturity of a testing build throughout its life cycle, I’m pleased to say we’ve put a lot of polish into this release.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights:
(guest blog post by Pete Skomoroch)
In a previous post, I outlined how to build a basic trend tracking site called trendingtopics.org with Cloudera’s Distribution for Hadoop and Hive. TrendingTopics uses Hadoop to identify the top articles trending on Wikipedia and displays related news stories and charts. The data powering the site was pulled from an Amazon EBS Wikipedia Public Dataset containing 8 months of hourly pageview logfiles. In addition to the pageview logs, the EBS data volume also includes the full text content and link graph for all articles. This post will use that link graph data to build a new feature for our site: grouping related articles together under a single “lead trend” to ensure the homepage isn’t dominated by a single news story.
In March of this year, we released our distribution for Apache Hadoop. Our initial focus was on stability and making Hadoop easy to install. This original distribution, now named CDH1, was based on the most stable version of Apache Hadoop at the time:0.18.3. We packaged up Apache Hadoop, Pig and Hive into RPMs and Debian packages to make managing Hadoop installations easier. For the first time ever, Hadoop cluster managers were able to bring up a deployment by running one of the following commands depending on your Linux distribution:
In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
In the process of working on a few things here I wanted to add some links to launch Apache Hive and the Hadoop Jobtracker. At first I considered just adding the links but I found myself wanting a button of some sort; an icon for them. I didn’t want to just use the (awesomely cute) Apache Hadoop logo elephant because these things are related to and part of Hadoop, but they aren’t Hadoop itself… What to do?
Well, I grabbed Illustrator and spent a bit of time putting together these icons. What do you think? We’ve opened up a ticket with the Hadoop project to contribute these to the project.