Cloudera Blog · Hive Posts
Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution. For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.
For those of you who are recent Cloudera users, here is a refresh on our update policy:
This post was contributed by Jonathan Seidman from Orbitz. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide . You can hear more from Jonathan at Hadoop World October 12th in NYC.
Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub, Additionally, the company operates business-to-business service: Orbitz Worldwide Distribution provides third parties such as Amtrak, Delta, LAN, KLM, Air France and a number of other leading airlines hotel booking capabilities, and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients. The Orbitz Worldwide sites process millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day. Not all of that data necessarily has value, but much of it does. Unfortunately storing and processing all of that data in our existing data warehouse infrastructure is impractical because of expense and space considerations.
Apache Hadoop was selected to provide a solution to the problem of long-term storage and processing of these large quantities of un-structured and semi-structured data. We deployed our first Hadoop clusters in late 2009 running Cloudera’s Distribution for Hadoop (CDH), and in early 2010 deployed Hive to provide structure and SQL-like access to Hadoop data. In the short period of time since our initial deployment we’ve seen Hadoop rapidly adopted as a component in a wide range of applications across the organization due to its power, ease of use, and suitability for solving big data problems.
Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.
We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.
With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.
If you’re not familiar with CDH3b2, here’s what you need to know.
All versions of CDH provide:
Announcing Two New Training Classes from Cloudera: Introduction to HBase and Analyzing Data with Hive and Pig
Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your organization. Please contact firstname.lastname@example.org for more information regarding on-site training, or visit www.cloudera.com/hadoop-training to view our public course schedule.
Cloudera’s HBase course discusses use-cases for HBase, and covers the HBase architecture, schema modeling, access patterns, and performance considerations. During hands-on exercises, students write code to access HBase from Java applications, and use the HBase shell to manipulate data. Introduction to HBase also covers deployment and advanced features.
Our Hive and Pig course is designed for developers who are skilled with SQL or scripting languages, but who are not Java experts. Hive and Pig are two approaches which allow non-Java programmers to access and manipulate massive amounts of data while abstracting away the complexities of MapReduce. Hive offers an SQL-like interface, while Pig’s scripting language, named PigLatin, is very easy for developers learn. This course covers both technologies, and includes multiple hands-on exercises to reinforce key concepts.
CDH3 beta 2 includes Apache Hive 0.5.0, the latest version of the popular open source Apache Hadoop data warehouse platform. Hive allows you to express data analysis tasks in a dialect of SQL called HiveQL, and then compiles these tasks into MapReduce jobs and executes the jobs on your Hadoop cluster. Hive is a natural entry point to Hadoop for people who have prior experience with relational databases, but even those who have never written a line of SQL should give it a chance since it is currently the only Hadoop dataflow programming platform to provide built-in facilities for managing metadata. This unique feature of Hive allows you to access your data through a Table abstraction, making it possible to cleanly separate your analysis logic from the details of how your data is formatted and parsed. This results in scripts that are easier to write and much easier to maintain.
While Hive is great it on its own, it’s even better when you connect it to other tools in the Hadoop ecosystem. Users can currently use Sqoop to import data from relational databases into Hive, run Hive jobs inside Oozie workflows, and design queries in the Beeswax query editor that comes included with Hue. Hive 0.6.0 will include new features that make it possible to seamlessly access HBase tables from Hive, and there is also work afoot to provide an integration point between Hive and Flume.
The 0.5.0 release of Hive includes a variety of feature enhancements and bug fixes that improve the usability and stability of the Hive platform. These changes include extensions to HiveQL such as support for the CREATE TABLE AS SELECT statement, LEFT SEMI JOINs, and LATERAL VIEWs, as well as support for User Defined Table Generating Functions. The 0.5.0 release also includes enhancements that improve the performance of GROUP BY aggregations and Hive’s RCFile columnar storage format.
Hadoop has emerged as an indispensable component of any data-intensive enterprise infrastructure. In many ways, working with large datasets on a distributed computing platform (powered by commodity hardware or cloud infrastructure) has never been easier. But because customers are running clusters consisting of hundreds or thousands of nodes, and are processing massive quantities of data from production systems every hour, the logistics of efficient platform utilization can quickly become overwhelming.
To deal with this challenge, the Yahoo! engineering team created Oozie – the Hadoop workflow engine. We are pleased to provide Oozie with Cloudera’s distribution for Hadoop starting with the beta-2 release.
Why create a new workflow system?
You might wonder why a new workflow system is necessary for Hadoop, given that there are quite a few existing commercial and open-source systems available. While it is possible to use existing general-purpose workflow systems with Hadoop, it is anything but simple. Intricacies such as monitoring long running jobs and interfacing with the distributed file system require extensive work to port general workflow systems to the Hadoop environment. Oozie, on the other hand, is designed specifically for the Hadoop platform and uses it as its execution environment. It has built-in support for Hadoop tasks and integrates with this environment cleanly. Oozie itself is fairly light-weight, requires minimal configuration, and scales linearly – thus offering a sustainable approach to building workflows in the Hadoop environment.
This post was contributed by John Sichi, a committer on the Apache Hive project and a member of the Data Infrastructure team at Facebook.
As many readers may already know, Hive was initially developed at Facebook for dealing with explosive growth in our multi-petabyte data warehouse. Since its release as an Apache project, it has been put into use at a number of other companies for solving big data problems. Hive storage is based on Hadoop‘s underlying append-only filesystem architecture, meaning that it is ideal for capturing and analyzing streams of events (e.g. web logs). However, a data warehouse also has to relate these event streams to application objects; in Facebook’s case, these include familiar items such as fan pages, user profiles, photo albums, or status messages.
Hive can store this information easily, even for hundreds of millions of users, but keeping the warehouse up to date with the latest information published by users can be a challenge, as the append-only constraint makes it impossible to directly apply individual updates to warehouse tables. Up until now, the only practical option has been to periodically pull snapshots of all of the information from live MySQL databases and dump them to new Hive partitions. This is a costly operation, meaning it can be done at most daily (leading to stale data in the warehouse), and does not scale well as data volumes continue to shoot through the roof.
That’s where Apache HBase comes in. HBase is a scaleout table store which can support a very high rate of row-level updates over massive amounts of data. It sidesteps Hadoop’s append-only constraint by keeping recently updated data in memory and incrementally rewriting data to new files, splitting and merging intelligently based on data distribution changes. Since it is based on Hadoop, making HBase interoperate with Hive is straightforward, meaning HBase tables can be accessed as if they were native Hive tables. As a result, a single Hive query can now perform complex operations such as join, union, and aggregation across combinations of HBase and native Hive tables. Likewise, Hive’s INSERT statement can be used to move data between HBase and native Hive tables, or to reorganize data within HBase itself.
Around the globe, more and more companies are turning to Hadoop to tackle data processing problems that don’t lend themselves well to traditional systems. Users in the community consistently ask us to offer training in more places and expand our course offerings, and those who have obtained certification have reported great success connecting with companies investing in Hadoop. All of this keeps us pretty excited about the long term prospects for Hadoop.
We recently announced our first international developer training sessions in Tokyo (sold out, waitlist available) and Taiwan, and we’re happy to follow up with sessions in the EU. We’ll be visiting London the first week of June, and Berlin the next. If you’ll be in Berlin that week, be sure to check out the Berlin Buzzwords conference – a two day event focused on Hadoop, Lucene, and NoSQL.
We’ve also put together new offerings for this years upcoming Hadoop Summit, and we’ve worked out a special deal with Yahoo! to waive the conference registration fee for anyone who attends a Cloudera training session at the 2010 Hadoop Summit (you’ll get a discount code for training in your conference registration confirmation). In addition to our developer certification course, we’ll offer an extended version of our Systems Administration course, as well as new, full-day course on HBase. One particularly exciting new offering is our full-day course on Hive, which opens Hadoop up to anyone who knows SQL.