Cloudera Engineering Blog · Hive Posts
Apache Hive 1.2.0, although not a major release, contains significant improvements.
Recently, the Apache Hive community moved to a more frequent, incremental release schedule. So, a little while ago, we covered the Apache Hive 1.0.0 release and explained how it was renamed from 0.14.1 with only minor feature additions since 0.14.0.
Learn how to read FIX message files directly with Hive, create a view to simplify user queries, and use a flattened Apache Parquet table to enable fast user queries with Impala.
The Financial Information eXchange (FIX) protocol is used widely by the financial services industry to communicate various trading-related activities. Each FIX message is a record that represents an action by a financial party, such as a new order or an execution report. As the raw point of truth for much of the trading activity of a financial firm, it makes sense that FIX messages are an obvious data source for analytics and reporting in Apache Hadoop.
A Hive-on-Spark beta is now available via CDH parcel. Give it a try!
The Hive-on-Spark project (HIVE-7292) is one of the most watched projects in Apache Hive history. It has attracted developers from across the ecosystem, including from organizations such as Intel, MapR, IBM, and Cloudera, and gained critical help from the Spark community.
The Apache Hive PMC has recently voted to release Hive 1.0.0 (formerly known as Hive 0.14.1).
This release is recognition of the work the Apache Hive community has done over the past nine years and is continuing to do. The Apache Hive 1.0.0 release is a codebase that was expected to be released as 0.14.1 but the community felt it was time to move to a 1.x.y release naming structure.
Interested in Hive-on-Spark progress? This new AMI gives you a hands-on experience.
Nearly one year ago, the Apache Hadoop community began to embrace Apache Spark as a powerful batch-processing engine. Today, many organizations and projects are augmenting their Hadoop capabilities with Spark. As part of this shift, the Apache Hive community is working to add Spark as an execution engine for Hive. The Hive-on-Spark work is being tracked by HIVE-7292 which is one of the most popular JIRAs in the Hadoop ecosystem. Furthermore, three weeks ago, the Hive-on-Spark team offered the first demo of Hive on Spark.
The community effort to make Apache Spark an execution engine for Apache Hive is making solid progress.
Apache Spark is quickly becoming the programmatic successor to MapReduce for data processing on Apache Hadoop. Over the course of its short history, it has become one of the most popular projects in the Hadoop ecosystem, and is now supported by multiple industry vendors—ensuring its status as an emerging standard.
This new feature, jointly developed by Cloudera and Intel engineers, makes management of role-based security much easier in Apache Hive, Impala, and Hue.
Apache Sentry (incubating) provides centralized authorization for services and applications in the Apache Hadoop ecosystem, allowing administrators to set up granular, role-based protection on resources, and to review them in one place. Previously, Sentry only designated administrators to
REVOKE privileges on an authorizable object. In Apache Sentry 1.5.0 (shipping inside CDH 5.2), we have implemented a new feature (SENTRY-327) that allows admin users to delegate the
GRANT privilege to other users using
WITH GRANT OPTION. If a user has the
GRANT OPTION privilege on a specific resource, the user can now grant the
GRANT privilege to other users on the same resource. Apache Hive, Impala, and Hue have all been updated to take advantage of this new Sentry functionality.
Our thanks to Rakesh Rao of Quaero, for allowing us to re-publish the post below about Quaero’s experiences using partitioning in Apache Hive.
In this post, we will talk about how we can use the partitioning features available in Hive to improve performance of Hive queries.
Two of the most vibrant communities in the Apache Hadoop ecosystem are now working together to bring users a Hive-on-Spark option that combines the best elements of both.
(Editor’s note [Feb. 25, 2015]: A Hive-on-Spark beta release is now available for download. Learn more here.)
Learn how HiveServer, Apache Sentry, and Impala help make Hadoop play nicely with BI tools when Kerberos is involved.
In 2010, I wrote a simple pair of blog entries outlining the general considerations behind using Apache Hadoop with BI tools. The Cloudera partner ecosystem has positively exploded since then, and the technology has matured as well. Today, if JDBC is involved, all the pieces needed to expose Hadoop data through familiar BI tools are available:
Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working as a Principal Big Data Consultant at Allstate Insurance, for the guest post below about his experiences with Impala.
It started with a simple request from one of the managers in my group at Allstate to put together a demo of Tableau connecting to Cloudera Impala. I had previously worked on Impala with a large dataset about a year ago while it was still in beta, and was curious to see how Impala had improved since then in features and stability.
More than 300 bug fixes and stable features in Apache Hive 0.13 have already been backported into CDH 5.0.0.
Last week, the Hive community voted to release Hive 0.13. We’re excited about the continued efforts and progress in the project and the latest release — congratulations to all contributors involved!
The following post, by Sarah Cannon of Digital Reasoning, was originally published in that company’s blog. Digital Reasoning has graciously permitted us to re-publish here for your convenience.
At the beginning of each release cycle, engineers at Digital Reasoning are given time to explore the latest in Big Data technologies, examining how the frequently changing landscape might be best adapted to serve our mission. As we sat down in the early stages of planning for Synthesys 3.8 one of the biggest issues we faced involved reconciling the tradeoff between flexibility and performance. How can users quickly and easily retrieve knowledge from Synthesys without being tied to one strict data model?
The CDH software stack lets you use your tool of choice with the Parquet file format – - offering the benefits of columnar storage at each phase of data processing.
An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:
This quick demo illustrates how easy it is to implement role-based access and control in Impala using Sentry.
Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.
Bringing Parquet support to Hive was a community effort that deserves congratulations!
Previously, this blog introduced Parquet, an efficient ecosystem-wide columnar storage format for Apache Hadoop. As discussed in that blog post, Parquet encodes data extremely efficiently and as described in Google’s original Dremel paper. (For more technical details on the Parquet format read Dremel made simple with Parquet, or go directly to the open and community-driven Parquet Format specification.)
Migrating from the Hive CLI to Beeline isn’t as simple as changing the executable name, but this post makes it easy nonetheless.
In its original form, Apache Hive was a heavyweight command-line tool that accepted queries and executed them utilizing MapReduce. Later, the tool split into a client-server model, in which HiveServer1 is the server (responsible for compiling and monitoring MapReduce jobs) and Hive CLI is the command-line interface (sends SQL to the server).
Cloudera provides docs and a sample build environment to help you get easily started writing your own Impala UDFs.
User-defined functions (UDFs) let you code your own application logic for processing column values during a Cloudera Impala query. For example, a UDF could perform calculations using an external math library, combine several column values into one, do geospatial calculations, or other kinds of tests and transformations that are outside the scope of the built-in SQL operators and functions.
Impala’s speed now beats the fastest SQL-on-Hadoop alternatives. Test for yourself!
Since the initial beta release of Cloudera Impala more than one year ago (October 2012), we’ve been committed to regularly updating you about its evolution into the standard for running interactive SQL queries across data in Apache Hadoop and Hadoop-based enterprise data hubs. To briefly recap where we are today:
A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive
One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.
The following post was originally published by the Hue Team at the Hue blog in a slightly different form.
Hue, the open source web GUI that makes Apache Hadoop easy to use, has supported Cloudera Impala since its inception to enable fast, interactive SQL queries from within your browser. In this post, you’ll see a demo of Hue’s Impala app in action and explore its impressive query speed for yourself.
Impala App Demo
This installment of the Hue demo series is about accessing the Hive Metastore from Hue, as well as using HCatalog with Hue. (Hue, of course, is the open source Web UI that makes Apache Hadoop easier to use.)
What is HCatalog?
HCatalog is a module in Apache Hive that enables non-Hive scripts to access Hive tables. You can then directly load tables with Apache Pig or MapReduce without having to worry about re-defining the input schemas, or caring about or duplicating the data’s location.
The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)
Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.
Every day, more data, users, and applications are accessing ever-larger Apache Hadoop clusters. Although this is good news for data driven organizations overall, for security administrators and compliance officers, there are still lingering questions about how to enable end-users under existing Hadoop infrastructure without compromising security or compliance requirements.
While Hadoop has strong security at the filesystem level, it lacks the granular support needed to adequately secure access to data by users and BI applications. Today, this problem forces organizations in industries for which security is paramount (such as financial services, healthcare, and government) to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.
Editor’s note (added Feb. 2, 2014): You can review the latest (and exciting) Impala performance benchmark results by Cloudera here.
In the presentation below, Scott Leberknight of Near Infinity has done such a good and thorough job of dissecting Cloudera Impala, we want to share it with you here.
Apache Hive was one of the first projects to bring higher-level languages to Apache Hadoop. Specifically, Hive enables the legions of trained SQL users to use industry-standard SQL to process their Hadoop data.
However, as you probably have gathered from all the recent community activity in the SQL-over-Hadoop area, Hive has a few limitations for users in the enterprise space. Until recently, two in particular – concurrency and security – were largely unaddressed.
Data analysts and business intelligence specialists have been at the heart of new trends driving business growth over the past decade, including log file and social media analytics. However, Big Data heretofore has been beyond the reach of analysts because traditional tools like relational databases don’t scale, and scalable systems like Apache Hadoop have historically required Java expertise.
In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).
The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!
Dataset Challenge with Hue
A World-Class EDW Requires a World-Class Hadoop Team
Persado is the global leader in persuasion marketing technology, a new category in digital marketing. Our revolutionary technology maps the genome of marketing language and generates the messages that work best for any customer and any product at any time. To assure the highest quality experience for both our clients and end-users, our engineering team collaborates with Ph.D. statisticians and data analysts to develop new ways to segment audiences, discover content, and deliver the most relevant and effective marketing messages in real time.
In this installment, meet Cloudera Software Engineer/Apache Bigtop Committer Mark Grover (@mark_grover).
Hue 2.2 , the open source web-based interface that makes Apache Hadoop easier to use, lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features different applications like an Apache Hive editor and Apache Oozie dashboard and workflow builder.
This post is based on our “Analyzing Twitter Data with Hadoop” sample app and details how the same results can be achieved through Hue in a simpler way. Moreover, all the code and examples of the previous series have been updated to the recent CDH4.2 release.
The following guest post comes to you from Alan Gardner of remote database services and consulting company Pythian, who participated in Data Hacking Day (and was on the winning team!) at Cloudera’s offices in February.
Last Feb. 25, just prior to attending Strata, Alex Gorbachev (our CTO) and I had the chance to visit Cloudera’s Palo Alto offices for Data Hacking Day. The goal of the event was to produce something cool that leverages Cloudera Impala – the new open source, low-latency platform for querying data in Apache Hadoop.
This guest post is provided by Rohit Menon, Product Support and Development Specialist at Subex.
I am a software developer in Denver and have been working with C#, Java, and Ruby on Rails for the past six years. Writing code is a big part of my life, so I constantly keep an eye out for new advances, developments, and opportunities in the field, particularly those that promise to have a significant impact on software engineering and the industries that rely on it.
In my current role working on revenue assurance products in the telecom space for Subex, I have regularly heard from customers that their data is growing at tremendous rates and becoming increasingly difficulty to process, often forcing them to portion out data into small, more manageable subsets. The more I heard about this problem, the more I realized that the current approach is not a solution, but an opportunity, since companies could clearly benefit from more affordable and flexible ways to store data. Better query capability on larger data sets at any given time also seemed key to derive the rich, valuable information that helps drive business. Ultimately, I was hoping to find a platform on which my customers could process all their data whenever they needed to. As I delved into this Big Data problem of managing and analyzing at mega-scale, it did not take long before I discovered Apache Hadoop.
Mission: Hands-On Hadoop
Last week Cloudera released the 4.5 release of Cloudera Manager, the leading framework for end-to-end management of Apache Hadoop clusters. (Download Cloudera Manager here, and see install instructions here.) Among many other features, Cloudera Manager 4.5 adds support for Apache Hive. In this post, I’ll explain how to set up a Hive server for use with Cloudera Manager 4.5 (and later).
For details about other new features in this release, please see the full release notes:
Cloudera Impala, the open-source real-time query engine for Apache Hadoop, uses many tools and techniques to get the best query performance. This blog post will discuss how we use runtime code generation to significantly improve our CPU efficiency and overall query execution time. We’ll explain the types of inefficiency that code-generation eliminates and go over in more detail one of the queries in the TPCH workload where code generation improves overall query speeds by close to 3x.
Why Code Generation?
The baseline for “optimal” query engine performance is a native application that is written specifically for your data format, written only to support your query. For example, it would be ideal if a query engine could execute this query:
Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production. We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.
But did you know Cloudera training can also help you plan for the advanced stages and progress of your Hadoop cluster? In addition to core training for Developers and Administrators, we also offer the best (and, in some cases, only) opportunity to get up to speed on lifecycle projects within the Hadoop ecosystem in a classroom setting. Cloudera University’s course offerings go beyond the basics to include Training for Apache HBase, Training for Apache Hive and Pig, and Introduction to Data Science: Building Recommender Systems. Depending on your Big Data agenda, Cloudera training can help you increase the accessibility and queryability of your data, push your data performance towards real-time, conduct business-critical analyses using familiar scripting languages, build new applications and customer-facing products, and conduct data experiments to improve your overall productivity and profitability.
Our thanks to guest author Jon Natkins (@nattyice) of WibiData for the following post!
Today, many (if not most) companies have ETL or data enrichment jobs that are executed on a regular basis as data becomes available. In this scenario it is important to minimize the lag time between data being created and being ready for analysis.
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Apache Hive is a fantastic tool for performing SQL-style queries across data that is often not appropriate for a relational database. For example, semistructured and unstructured data can be queried gracefully via Hive, due to two core features: The first is Hive’s support of complex data types, such as structs, arrays, and unions, in addition to many of the common data types found in most relational databases. The second feature is the SerDe.
What is a SerDe?
The SerDe interface allows you to instruct Hive as to how a record should be processed. A SerDe is a combination of a Serializer and a Deserializer (hence, Ser-De). The Deserializer interface takes a string or binary representation of a record, and translates it into a Java object that Hive can manipulate. The Serializer, however, will take a Java object that Hive has been working with, and turn it into something that Hive can write to HDFS or another supported system. Commonly, Deserializers are used at query time to execute
SELECT statements, and Serializers are used when writing data, such as through an
Our hearty congratulations to the Cloudera engineers who have been accepted as ApacheCon NA 2013 (Feb. 26-28 in Portland, OR) speakers for these talks:
At Cloudera, we put great pride into drinking our own champagne. That pride extends to our support team, in particular.
Cloudera Manager, our end-to-end management platform for CDH (Cloudera’s open-source, enterprise-ready distribution of Apache Hadoop and related projects), has a feature that allows subscription customers to send a snapshot of their cluster to us. When these cluster snapshots come to us from customers, they end up in a CDH cluster at Cloudera where various forms of data processing and aggregation can be performed.
This is the third article in a series about analyzing Twitter data using some of the components of the Apache Hadoop ecosystem that are available in CDH (Cloudera’s open-source distribution of Apache Hadoop and related projects). If you’re looking for an introduction to the application and a high-level view, check out the first article in the series.
In the previous article in this series, we saw how Flume can be utilized to ingest data into Hadoop. However, that data is useless without some way to analyze the data. Personally, I come from the relational world, and SQL is a language that I speak fluently. Apache Hive provides an interface that allows users to easily access data in Hadoop via SQL. Hive compiles SQL statements into MapReduce jobs, and then executes them across a Hadoop cluster.
Today we bring you a brief interview with Alex Holmes, author of the new book, Hadoop in Practice (Manning). You can learn more about the book and download a free sample chapter here.
There are a few good Hadoop books on the market right now. Why did you decide to write this book, and how is it complementary to them?
When I started working with Hadoop I leaned heavily on Tom White’s excellent book, Hadoop: The Definitive Guide (O’Reilly Media), to learn about MapReduce and how the internals of Hadoop worked. As my experience grew and I started working with Hadoop in production environments I had to figure out how to solve problems such as moving data in and out of Hadoop, using compression without destroying data locality, performing advanced joining techniques and so on. These items didn’t have a lot of coverage in existing Hadoop books, and that’s really the idea behind Hadoop in Practice – it’s a collection of real-world recipes that I learned the hard way over the years.
After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.
When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.
In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.
Who is Influential?
Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.
Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
Over the last couple of months the Hive team at Cloudera has been working hard to bring a bunch of exciting new features to Apache Hive. In this blog post, I’m going to talk about one such feature – Column Statistics in Hive – and how Hive’s query processing engine can benefit from it. The feature is currently a work in progress but we expect it to be available for review imminently.
While there are many possible execution plans for a query, some plans are more optimal than others. The query optimizer is responsible for generating an efficient execution plan for a given SQL query from the space of all possible plans. Currently, Hive’s query optimizer uses rules of thumbs to generate an efficient execution plan for a query. While such rules of thumb optimizations transform the query plan into a more efficient one, the resulting plan is not always the most efficient execution plan.