Analytics and BI on Amazon S3 with Apache Impala (Incubating)

Categories: Cloud Impala Ops and DevOps Performance

Thanks to new optimizations for running Impala on Amazon S3, doubling cluster size on AWS doubles multi-user performance while keeping total workload cost roughly the same.

With public-cloud deployments becoming increasingly popular, Cloudera is continuing to build out the capabilities of its platform to best take advantage of the cost-effective and flexible nature of the cloud. The current release of Cloudera’s platform (5.8) includes a major step forward in that area with Impala 2.6 able to store and query data directly from the Amazon S3 object store. By decoupling data and compute, Impala enables high-performance analytics across heterogeneous data stores at significantly reduced cost.

While traditional, monolithic databases including Amazon’s own cloud database offerings require you to prepare data elsewhere and then move it into the database—causing costly delays and volume limitations—Cloudera’s analytic database solution, with Impala at its core, provides direct, low-latency access to data wherever it may live. Today, potential data sources include S3, HDFS, Apache Kudu (beta), EMC Isilon, or Apache HBase; in the future, they will include other object stores, as well as combinations of storage engines and the tools for preparing and analyzing that data, all within the same platform.

Furthermore, with data stored on S3, Impala can enable elastic cloud use cases, including spinning up transient clusters and elastically scaling compute resources (again, benefits that are not possible with monolithic database architectures). This approach is possible because with Impala running on S3, performance scales linearly with compute: in our testing, doubling Impala cluster size (from 32 to 64 nodes) on AWS for multi-user workload doubles performance while keeping total workload cost roughly the same. This powerful cost-optimization advantage lets your ops team allocate just the right amount of compute to match latency requirements, and then simply add more compute resources as needed for better performance.

With Cloudera Director 2.1, time-to-insights gets even shorter as one can simply deploy and manage production-ready Impala clusters on AWS, taking the pain and excessive cost out of setting up Apache Hadoop on the cloud. (See the demo video at the end of this post for more details about how to do that.)

In the remainder of this post, we’ll describe the new S3 functionality in Impala, the new optimizations that were needed to enable that support, and the results of internal scalability testing.

Functionality Details

As mentioned previously, Impala 2.6 introduces a range of new S3-oriented functionality, including:

  • Direct querying of data from Amazon S3: Data made available on S3 can be queried without having to move it to an HDFS cluster. This feature eliminates a painfully slow data-movement process before you can get insights from your data. As Apache Hive also supports S3, you can prepare and analyze data directly, all in one place.
  • Elastic scaling of compute without downtime: Storing data on S3 allows provisioning of compute resources to match requirements and scale nodes without incurring downtime. In contrast, other analytic databases, such as Amazon Redshift, cannot add nodes to a cluster and instead require you to copy all your data into a new cluster on every resize.
  • Data portability and flexibility: Since data stored on S3 can be read from Impala, Apache Spark, Hive, or generic non-Hadoop application clusters, data is not bound to a single application. This flexibility enables use of the same data across multiple ecosystem components like Spark or MapReduce, without needing to move data into an HDFS cluster
  • Seamless data access across multiple filesystems: Impala can read and write data on S3 as if it were stored on an HDFS cluster—enabling new use cases such as storing hot data on HDFS and moving cold data into S3, all on the same Impala cluster. Furthermore, table partitions can be split across filesystems, letting you keep more recent partitions on HDFS and move older ones to S3.
  • Spin-up and spin-down of transient clusters: With data residing on S3, operators can spin up and down clusters for changing needs and save on compute hosting costs without time-consuming data-load operations. (Cloudera Director makes spinning up and down clusters even easier; see demo to follow.)

Next, we’ll explain the optimizations that make this functionality possible.

Cloud-Oriented Optimizations in Impala 2.6

As a modern MPP database, Impala was designed from the outset to work with multiple data sources. However, S3 has several functional differences from HDFS, introducing a new set of technical challenges to be addressed by the Impala team.

Unlike HDFS, S3 has no such thing as locality for reads, has no notion of blocks, latency is higher, and S3 throttles responses to a host that makes too many requests. Furthermore, while “rename” is just a metadata operation in HDFS, it is a full copy-and-delete in S3.

diagram_Eng_blog_Impala_S3

To make Impala work effectively on S3, the following improvements, among others, were made to Impala’s read and write paths.

Synthesizing Blocks for Apache Parquet Files on S3

When reading a table from HDFS, each Impala daemon is assigned a set of scan ranges. These ranges are computed based on the subdivision of files into blocks. For optimal Parquet scan performance, Parquet files are usually written in a way where the row groups fit within a block and do not cross block boundaries. This is done so that a node assigned a block to scan will not end up having to do remote reads on another remote block to complete reading its row group(s). It also causes skew in the cluster, meaning that some nodes read many more bytes than other nodes, which is bad for scan performance. However, since S3 has no notion of blocks or block metadata, Impala synthesizes “artificial” blocks by splitting up each file into fixed-sized chunks, based on what it expects the row group size of the Parquet files to be—thus, each Impala daemon has all the data needed to perform its scan in a single block.

Virtual Disk Queue for S3 Reads

Impala has a sophisticated IO manager that provides high-bandwidth disk reads. It was originally designed for the HDFS model, where one node has many attached disks. Impala uses metadata provided by HDFS to know on which disk a particular range is located and the IO manager maintains a queue per attached physical disk, which allows reads to be assigned directly to their “local” disk and for each disk to process its workload in parallel. Since S3 is not disk-based, there was no appropriate queue to which S3 requests would be issued: using an existing queue for a physical disk would have led to S3 requests (which are typically higher latency), blocking any queued HDFS requests for that disk. Instead, Impala now maintains a single, separate queue for S3 reads, treating S3 as a “virtual” disk. As a result, S3 reads and HDFS reads can execute in parallel.

Option to Skip Staging Data on the INSERT Path

When Impala executes an INSERT query, each Impala daemon writes a portion of the output data to a temporary staging location that is not yet part of the table. When all those writes are completed, the coordinator moves all the new data files to their final table location. With HDFS, these move operations are performed using rename(), which is a metadata-only operation and is relatively fast.

However, the S3A implementation of the rename() primitive physically copies all the data from its source location to its destination, which results in a much longer delay for data to be visible to the user. To avoid this delay, Impala supports an S3-specific query-option called S3_SKIP_INSERT_STAGING. This option causes each Impala daemon to write the table data directly to its final location in the table, and the coordinator will skip the rename() phase.

This approach improves the end-to-end latency of writes to S3 significantly. However, enabling S3_SKIP_INSERT_STAGING also trades some fault tolerance for performance. If a node fails during the write phase, Impala will not remove any partially written data from the table, as table writes are not transactional. The result is that clients could read partially-written data from the table, if a node fails during the write phase. Without S3_SKIP_INSERT_STAGING, the coordinator can react to node failure and cancel the query without every making the partially-written data visible from its temporary location. As a result, S3_SKIP_INSERT_STAGING is best for repeatable INSERTs, like INSERT OVERWRITE, that can be re-run in the case of failure to produce the same result.

Scalability Testing

As previously described, Impala’s decoupled architecture allows you to grow or shrink your compute without worrying about data stored on S3. In the case of our single-user benchmark, doubling cluster size from 32 to 64 nodes nearly doubles (1.8x) performance for just an additional 12% cost.

impala-s3-f1

Our benchmark results showed even more exciting results of Impala scaling on S3 with multi-user testing. In our results, when we doubled the cluster size (again from 32 to 64 nodes), performance exactly doubled while reducing the end-to-end workload cost by 4%.

impala-s3-f3

This result is great news for users; it means you can improve cluster performance just by adding additional nodes. (In this case, doubling the cluster size slightly reduced the total cost.)

Getting Started

Impala documentation includes a great step-by-step guide to help you get started with Impala on S3. You can find it here.

We have also put together step-by-step instruction video for you to learn how to deploy an Impala cluster on S3 using Cloudera Director 2.1:

Conclusion

Support for read/write on S3 is major step for Impala on the cloud as it provides functionality and cloud elasticity that’s beyond what you get with monolithic databases, even those shipped by Amazon.

While Impala on S3 opens up new cloud-native architectures, the performance of remote S3 understandably doesn’t match the performance of local EBS storage. In a future post, we’ll more thoroughly explore the performance differences between S3 and EBS and how/when you might use each (or both) of them for different portions of your data.

If you would like to contribute to Impala, please do get in touch!

Devadutta Ghat is a Senior Product Manager at Cloudera.

Sailesh Mukil is a Software Engineer at Cloudera, working on the Impala team.

Mostafa Mokhtar is a Software Engineer at Cloudera, working on the Impala team.

Henry Robinson is a Software Engineer at Cloudera, and a PMC Member of Impala.

Marcel Kornacker is a Tech Lead at Cloudera and the founder of Impala.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *