Our thanks to Manuel Spezzani, Indyco Technical Leader, and Edward William Gnudi, Indyco’s Chief of Customer Happiness, for the guest post below about using Indyco alongside Apache Impala.
In this post, you will learn how to automatically design a complete data warehouse solution on top of Impala using Indyco, a tool for designing, exploring, and understand your business model (recently named Cloudera Certificated Partner for the Impala platform).
The following post was originally published in the Ibis project blog. (Ibis is a data analysis framework incubating in Cloudera Labs that brings Apache Hadoop scale to Python development.)
The new Apache Kudu (incubating) columnar storage engine together with Apache Impala (incubating) interactive SQL engine enable a new fully open source big data architecture for data that is arriving and changing very quickly. By integrating Kudu and Impala with Ibis,
Fixes in CDH 5.5 make writing Parquet data for Apache Impala (incubating) much easier.
Over the last few months, several Cloudera customers have provided the feedback that Parquet is too hard to configure, with the main problem being finding the right layout for great performance in Impala. For that reasons, CDH 5.5 contains new features that make those configuration problems go away.
Auto-Detection of HDFS Block Size
The new support for complex types in Impala makes running analytic workloads considerably simpler.
Impala 2.3 (shipping starting in Cloudera Enterprise 5.5) contains support for querying complex types in Apache Parquet tables, specifically ARRAY, MAP, and STRUCTs. This capability enables users to query against naturally nested data sets without having to perform ETL to flatten them. This feature provides a few major benefits, including:
- It removes additional ETL and data modeling work to flatten data sets.
Cloudera Navigator Optimizer, a new (beta) component of Cloudera Enterprise, helps optimize inefficient query workloads for best results on Apache Hadoop.
With the proliferation of Apache Hadoop deployments, more and more customers are looking to reduce operational overheads in their enterprise data warehouse (EDW) installations by exploiting low-cost, highly scalable, open source SQL-on-Hadoop frameworks such as Impala and Apache Hive. Processing portions of SQL workloads better suited to Hadoop on these frameworks,