Apache Impala (Incubating) on Amazon: Performance and Cost Considerations for S3 vs. EBS

Categories: Cloud Impala Performance

The benchmark testing results detailed below can help you make an informed decision about AWS storage options for Impala.

In a recent post, you learned how Impala 2.6 on S3 delivers cloud-native features unmatched by other analytic databases in the cloud. With support to read/write data from Amazon S3, Impala provides cloud capabilities such as direct querying of data from S3, elastic scaling of compute, and seamless data portability and flexibility not found on other cloud-based analytic databases, such as Amazon Redshift.

Although S3 provides simple storage at relatively affordable prices, it’s important to understand the performance and cost considerations involved when storing data on S3 vs. Amazon Elastic Block Storage (EBS) vs. attached storage in an on-premise deployment. In this post, we’ll explore some of those performance considerations in detail so you can make informed decisions.

Background

EBS is attached to the AWS compute node as a fully-functional filesystem (similar to an attached SSD on an on-premise node), and Impala makes use of several filesystem features to deliver higher throughput and lower latency. These features include:

  • HDFS short-circuit reads to bypass HDFS and read files directly from the filesystem
  • OS buffer cache to read frequently accessed files directly from the cache instead of fetching it again
  • Fixed-cost file renames through metadata operations

In contrast, S3 is an object store that is accessed over the network. However, with S3, throughput is better than simple network-attached storage because of its dedicated, high-performance networks. In Cloudera’s internal benchmark testing (detailed below), on an r3.2xlarge, we saw a consistent throughput of about 100MB/s. Furthermore, in S3, there is currently no equivalent to HDFS short-circuit reads. Move/rename operations for data stored in S3 is a copy followed by a delete, while a file move on HDFS is a metadata operation—which is usually problematic for ETL workloads, as they create large number of small files that are typically moved.

With respect to use cases, typically Impala on S3 is used for transient clusters that query data at rest and for reporting on large volumes of data at rest. Impala on EBS or attached storage, however, is typically used for long-running Impala clusters and for interactive reporting on hot data.

Keeping these differences in mind, let’s look at some benchmarks.

Benchmarks

Recently, we ran Impala on both S3 and EBS (GP2) with a TPC-DS-derived benchmark on a 3TB dataset, using the 70 out of 99 queries that run without any modifications. The schema, as defined by TPC-DS, comprised fact tables partitioned on the date column.

On a 32-node r3.2xlarge cluster, Impala on EBS was 2.4x faster than on S3:

tpcds-s3-ebs

Running Impala on the same number of nodes on EBS was also slightly cheaper than on S3. (Storage cost for S3 and EBS is calculated as the price for storing the dataset for the duration for the test [data set size x $/GB for test duration].)

multi-testcost

Scalability

To recap our previous performance testing results, doubling Impala cluster size on AWS doubles performance for roughly the same cost.

Impala’s decoupled architecture allows you to grow or shrink your compute independently of the amount of data stored in S3. In the case of our single-user benchmark, doubling cluster size (from 32 to 64 nodes) almost doubled (1.8x) performance for just an additional 12% cost.

impala-s3-f1

Our benchmark results yielded even more exciting results for multiple users: When we doubled the cluster size (again from 32 to 64 nodes), performance exactly doubled while reducing the end-to-end workload cost by 4%.

impala-s3-f3

The above result is great news for users because it demonstrates that cluster performance improves just by adding additional nodes—and in this case, doubling the cluster size actually reduced end-to-end cost slightly.

(Note: Apart from the bandwidth difference between S3 and EBS, we noticed a considerable variance in S3 response times that was within 43 ms only in the 10th percentile, and varied from 166 ms in the 50th percentile all the way to 4052 ms in the 99th percentile.)

When to Use S3 vs. EBS

While EBS looks like the clear winner based on the above, the right storage choice for you will depend on your workload. The table below should help you make that calculation.

s3-ebs-tab

Conclusion

It seems clear from the above that Impala opens up new cloud-native architectures involving S3. (It is important to understand, though, that the performance of remote S3 understandably doesn’t match the performance of local EBS storage.)

In upcoming releases, you can expect to see even better performance on S3 through optimizations to avoid S3 throttling and to perform better listing and caching. We will also be adding support for other object stores so you can use Impala to get insights directly from your object store of choice.

If you would like to contribute to Apache Impala (Incubating), please do get in touch with us!

Devadutta Ghat is a Senior Product Manager at Cloudera.

Mostafa Mokhtar is a Software Engineer at Cloudera, working on the Impala team.

Henry Robinson is a Software Engineer at Cloudera, and member of the Apache Impala PMC.

Sailesh Mukil is a Software Engineer at Cloudera, and a member of the Apache Impala PMC.

 

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

One response on “Apache Impala (Incubating) on Amazon: Performance and Cost Considerations for S3 vs. EBS

  1. Pettax

    Thank you for this blog post! To make it complete I think you should also include figures from using ephemeral disks (instance store volumes) such as when d2.8xlarge instances are used as worker nodes. This would be the third storage option to use on AWS and presumably the most performant. Or?

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *