Cloudera Developer Blog · Impala Posts
OSCON 2013 is already receding in the rear-view mirror, but we had a great time. Cloudera’s sessions were very well attended — with Tom Wheeler taking the prize (well over 200 attendees for his “Introduction to Apache Hadoop” tutorial) — but best of all was the opportunity to meet and mingle with people in the broader open source community. If you visited us at Booth 420, we hope you will now download and install the QuickStart VM after seeing it in our demo, and that your questions were adequately answered (most popular question: “Can you tell me more about Cloudera Impala?”)
In my biased opinion, the crowning achievement was our ability to not only distribute a couple hundred “Data is the New Bacon” Tshirts within a 36-hour period, but to clean ourselves out of the meat-free version shortly thereafter, as well:
Every day, more data, users, and applications are accessing ever-larger Apache Hadoop clusters. Although this is good news for data driven organizations overall, for security administrators and compliance officers, there are still lingering questions about how to enable end-users under existing Hadoop infrastructure without compromising security or compliance requirements.
While Hadoop has strong security at the filesystem level, it lacks the granular support needed to adequately secure access to data by users and BI applications. Today, this problem forces organizations in industries for which security is paramount (such as financial services, healthcare, and government) to make a choice: either leave data unprotected or lock out users entirely. Most of the time, the preferred choice is the latter, severely inhibiting access to data in Hadoop.
Today, Cloudera is excited to launch Sentry, a new open source project that addresses these concerns. Sentry is an authorization module for Hadoop that provides the granular, role-based authorization required to provide precise levels of access to the right users and applications. Its new support for role-based authorization, fine-grained authorization, and multi-tenant administration allows Hadoop operators to:
In the presentation below, Scott Leberknight of Near Infinity has done such a good and thorough job of dissecting Cloudera Impala, we want to share it with you here.
Notably, Scott has run unscientific but revealing benchmarks based on the current version (1.0.1) inside the QuickStart VM compared to Apache Hive 0.11. (Spoiler: Impala queries were up to 39x faster for interactive queries.) See here for a set of more scientific benchmarks based on concurrent interactive queries run by Cloudera recently (Impala up to 68x faster in that case).
Conclusion: Hive continues to improve as a batch processing/MapReduce framework with Cloudera’s help. But for interactive SQL for Hadoop, Impala is the solution. View for yourself below!
For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.
We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.
Cloudera Search integrates Apache Solr with the rest of the platform, to let you do full-text search of the data stored in your cluster, just like you would with an online search-engine! Cloudera Impala, on the other hand, lets you execute SQL queries against that same data, on the same platform, and get results back fast enough to interactively explore and analyze. With both these workloads available on the cluster, it eliminates the pain of having to move large data sizes around.
Data analysts and business intelligence specialists have been at the heart of new trends driving business growth over the past decade, including log file and social media analytics. However, Big Data heretofore has been beyond the reach of analysts because traditional tools like relational databases don’t scale, and scalable systems like Apache Hadoop have historically required Java expertise.
Today, the rise of new ecosystem tools is rapidly broadening the community using Hadoop and Big Data. Projects like Cloudera Impala and Apache Hive and Apache Pig have for the first time made Big Data accessible to those with traditional analytics backgrounds. With the launch of Data Analyst Training, Cloudera is helping the world’s analysts prove there’s nothing traditional about data analytics and BI on Hadoop.
The Democratization of Big Data
Cloudera Impala has many exciting features, but one of the most impressive is the ability to analyze data in multiple formats, with no ETL needed, in HDFS and Apache HBase. Furthermore, you can use multiple frameworks, such as MapReduce and Impala, to analyze that same data. Consequently, Impala will often run side-by-side with MapReduce on the same physical hardware, with both supporting business-critical workloads. For such multi-tenant clusters, Impala and MapReduce both need to perform well despite potentially conflicting demands for cluster resources.
In this post, we’ll share our experiences configuring Impala and MapReduce for optimal multi-tenant performance. Our goal is to help users understand how to tune their multi-tenant clusters to meet production service level objectives (SLOs), and to contribute to the community some test methods and performance models that can be helpful beyond Cloudera.
Defining Realistic Test Scenarios
Cloudera’s broad and diverse customer base makes it a top concern to do testing for real-world scenarios. Realistic tests based on common use cases offer meaningful guidance, whereas guidance based on contrived, unrealistic testing often fails to translate to real-life deployments.
Our thanks to Brian Dirking, Director of Product Marketing for Alteryx, for the guest post below:
At Alteryx we are excited about the release of Cloudera Impala. The impact on Big Data Analytics is that the ability to perform real-time queries on Apache Hadoop will provide faster access and results. This is applicable to our customers, the business users who are running analytics to get access to data, perform analytics, and then follow up with new questions. Insight doesn’t happen all at once. The ability to query and refine quickly is ultimately what will lead business users to insight.
As business users need faster access to data, Alteryx provides a user friendly way to access new solutions like Impala. With Impala support in Alteryx Strategic Analytics, business users can get faster access, and can refine data queries and the corresponding analytics to get the answers they need. They can combine these results with other datasets to provide the context necessary to make the right decision, and they can do it without having to go through months of training to master programming and query languages.
“Are data warehouses becoming victims of their own success?”, Tony Baer asks in a recent blog post:
Our thanks to Ted Wasserman, product manager for Tableau, for the guest post below:
Many of our customers are turning to Apache Hadoop as they grapple with their big data challenges. Hadoop offers many benefits such as its scalability, economics, and versatility. Even so, adoption-to-date has largely centered around applications with “batch”-oriented workloads because of the latency imposed by the MapReduce framework. To increase Hadoop’s usefulness and adoption in the business intelligence space where users need fast, interactive response times when they ask a question, a new approach was needed.
Cloudera Impala technology moves the ball forward for doing ad hoc visual analytics on Hadoop. In particular, we like Impala for several reasons:
Our thanks to Yves de Montcheuil, Vice President of Marketing for Talend, for the guest post below:
According to Wikipedia, the impala is a medium-sized African antelope; its name comes from the Zulu language meaning “gazelle”. Like elephants, it is found in savannas, and this may be the link with Hadoop. Impala is also the name of Cloudera’s SQL-on-Apache Hadoop project, launched in beta at Strata last October and just released in version 1.0.
SQL-on-Hadoop – wait a minute… isn’t it what Apache Hive is for? Well, yes and no. HiveQL certainly brings a set of SQL-like commands to Hadoop data. The big issue with Hive: it’s very slow. More precisely, it’s not interactive. Queries take a long time to be “parsed” and distributed across the cluster. Response times can reach the minute, which is highly impractical for interactive use. It works fine for batch use (response times actually don’t vary much based on the dataset size), but when users want to mine Hadoop data, perform interactive queries or drill-downs, profile data, etc. – they end up spending lots of time glaring at their screen (or fetching more coffee than they should).