Cloudera Blog · Impala Posts
Data analysts and business intelligence specialists have been at the heart of new trends driving business growth over the past decade, including log file and social media analytics. However, Big Data heretofore has been beyond the reach of analysts because traditional tools like relational databases don’t scale, and scalable systems like Apache Hadoop have historically required Java expertise.
Today, the rise of new ecosystem tools is rapidly broadening the community using Hadoop and Big Data. Projects like Cloudera Impala and Apache Hive and Apache Pig have for the first time made Big Data accessible to those with traditional analytics backgrounds. With the launch of Data Analyst Training, Cloudera is helping the world’s analysts prove there’s nothing traditional about data analytics and BI on Hadoop.
The Democratization of Big Data
Cloudera Impala has many exciting features, but one of the most impressive is the ability to analyze data in multiple formats, with no ETL needed, in HDFS and Apache HBase. Furthermore, you can use multiple frameworks, such as MapReduce and Impala, to analyze that same data. Consequently, Impala will often run side-by-side with MapReduce on the same physical hardware, with both supporting business-critical workloads. For such multi-tenant clusters, Impala and MapReduce both need to perform well despite potentially conflicting demands for cluster resources.
In this post, we’ll share our experiences configuring Impala and MapReduce for optimal multi-tenant performance. Our goal is to help users understand how to tune their multi-tenant clusters to meet production service level objectives (SLOs), and to contribute to the community some test methods and performance models that can be helpful beyond Cloudera.
Defining Realistic Test Scenarios
Cloudera’s broad and diverse customer base makes it a top concern to do testing for real-world scenarios. Realistic tests based on common use cases offer meaningful guidance, whereas guidance based on contrived, unrealistic testing often fails to translate to real-life deployments.
Our thanks to Brian Dirking, Director of Product Marketing for Alteryx, for the guest post below:
At Alteryx we are excited about the release of Cloudera Impala. The impact on Big Data Analytics is that the ability to perform real-time queries on Apache Hadoop will provide faster access and results. This is applicable to our customers, the business users who are running analytics to get access to data, perform analytics, and then follow up with new questions. Insight doesn’t happen all at once. The ability to query and refine quickly is ultimately what will lead business users to insight.
As business users need faster access to data, Alteryx provides a user friendly way to access new solutions like Impala. With Impala support in Alteryx Strategic Analytics, business users can get faster access, and can refine data queries and the corresponding analytics to get the answers they need. They can combine these results with other datasets to provide the context necessary to make the right decision, and they can do it without having to go through months of training to master programming and query languages.
“Are data warehouses becoming victims of their own success?”, Tony Baer asks in a recent blog post:
Our thanks to Ted Wasserman, product manager for Tableau, for the guest post below:
Many of our customers are turning to Apache Hadoop as they grapple with their big data challenges. Hadoop offers many benefits such as its scalability, economics, and versatility. Even so, adoption-to-date has largely centered around applications with “batch”-oriented workloads because of the latency imposed by the MapReduce framework. To increase Hadoop’s usefulness and adoption in the business intelligence space where users need fast, interactive response times when they ask a question, a new approach was needed.
Cloudera Impala technology moves the ball forward for doing ad hoc visual analytics on Hadoop. In particular, we like Impala for several reasons:
Our thanks to Yves de Montcheuil, Vice President of Marketing for Talend, for the guest post below:
According to Wikipedia, the impala is a medium-sized African antelope; its name comes from the Zulu language meaning “gazelle”. Like elephants, it is found in savannas, and this may be the link with Hadoop. Impala is also the name of Cloudera’s SQL-on-Apache Hadoop project, launched in beta at Strata last October and just released in version 1.0.
SQL-on-Hadoop – wait a minute… isn’t it what Apache Hive is for? Well, yes and no. HiveQL certainly brings a set of SQL-like commands to Hadoop data. The big issue with Hive: it’s very slow. More precisely, it’s not interactive. Queries take a long time to be “parsed” and distributed across the cluster. Response times can reach the minute, which is highly impractical for interactive use. It works fine for batch use (response times actually don’t vary much based on the dataset size), but when users want to mine Hadoop data, perform interactive queries or drill-downs, profile data, etc. – they end up spending lots of time glaring at their screen (or fetching more coffee than they should).
Our thanks to Kevin Spurway, Senior Vice President of Marketing for MicroStrategy Inc., for the guest post below:
Squeezing insight from Big Data isn’t easy. It’s a delicate balance between scalability, performance, and cost effectiveness across an entire architecture, spanning everything from data storage to mobile app consumption. That’s why MicroStrategy and Cloudera have been working closely together from a technology standpoint. And, that’s why we’re proud to stand as a launch partner, certifying the integration between Cloudera’s new Impala project and our core MicroStrategy enterprise analytics platform.
Impala is a giant step toward an era of highly cost-effective interactive analytics for Hadoop-based Big Data.
We’ve been collaborating with Cloudera on Impala since its early stages, actively testing functionality, recommending enhancements, reviewing roadmaps, and sharing performance results. We’re especially enthusiastic because we see the launch of Impala as a giant step toward an era of highly cost-effective interactive analytics for Apache Hadoop-based Big Data, at speeds previously not possible.
This week represents quite a milestone for Cloudera and, at least we’d like to believe, the Hadoop ecosystem at large: the general availability release of Cloudera Impala. Since we launched the Impala beta program last fall, I’ve been fortunate enough to work with many of the 40+ early adopters who’ve been testing this near-real-time SQL-on-Hadoop engine in an effort to learn about their use cases and keep tabs on early experiences with the tool.
Customers running Impala today span a variety of industries, from large biotech company to online travel provider to digital advertiser to major financial institution, and each one has a unique use case for Impala. Stay tuned to learn more about their various use cases.
This week, I’d like to highlight Six3 Systems’ Wayne Wheeles (also a Champion of Big Data), who has been working with Impala to improve cyber security solutions, in particular the open source SherpaSurfing product.
On Monday April 29, Cloudera announced a strategic alliance with SAS. As the industry leader in business analytics software, SAS brings a formidable toolset to bear on the problem of extracting business value from large volumes of data.
Over the past few months, Cloudera has been hard at work along with the SAS team to integrate a number of SAS products with Apache Hadoop, delivering the ability for our customers to use these tools in their interaction with data on the Cloudera platform. In this post, we will delve into the major mechanisms that are available for connecting SAS to CDH, Cloudera’s 100% open-source distribution including Hadoop.
SAS/ACCESS to Hadoop
SAS/ACCESS provides the ability to access data sets stored in Hadoop in SAS natively. With SAS/Access to Hadoop:
In October 2012, we introduced the Impala project, at that time the first known effort to bring a modern, open source, distributed SQL query engine to Apache Hadoop. Our release of source code and a beta implementation were met with widespread acclaim — and later inspired similar efforts across the industry that now measure themselves against the Impala standard.
Today, we are proud to announce the first production drop of Impala (download here), which reflects feedback from across the user community based on multiple types of real-world workloads. Just as a refresher, the main design principle behind Impala is complete integration with the Hadoop platform (jointly utilizing a single pool of storage, metadata model, security framework, and set of system resources). This integration allows Impala users to take advantage of the time-tested cost, flexibility, and scale advantages of Hadoop for interactive SQL queries, and makes SQL a first-class Hadoop citizen alongside MapReduce and other frameworks. The net result is that all your data becomes available for interactive analysis simultaneously with all other types of processing, with no ETL delays needed.
Although the features and performance results described below are impressive, it’s important to note that they represent only a down payment toward the full promise of Impala. There is much more to come — and soon.