Category Archives: Performance

Analytics and BI on Amazon S3 with Apache Impala (Incubating)

Categories: Cloud Impala Ops and DevOps Performance

Thanks to new optimizations for running Impala on Amazon S3, doubling cluster size on AWS doubles multi-user performance while keeping total workload cost roughly the same.

With public-cloud deployments becoming increasingly popular, Cloudera is continuing to build out the capabilities of its platform to best take advantage of the cost-effective and flexible nature of the cloud. The current release of Cloudera’s platform (5.8) includes a major step forward in that area with Impala 2.6 able to store and query data directly from the Amazon S3 object store.

Read more

New Study: Evaluating Apache HBase Performance on Modern Storage Media

Categories: Guest Hardware HBase Performance

For the first time, this new study by Intel software engineers analyzes the performance impact of using Apache HBase on various modern storage technologies.

As more “fast” storage technologies (such as SSD and NVMe SSD) emerge, organizations with big data use cases want to make better use of them to achieve better throughput and latency. But to this point, there have been no detailed analyses published about the true significance of that performance boost,

Read more

How-to: Improve Apache HBase Performance via Data Serialization with Apache Avro

Categories: Avro HBase Performance

Taking a thoughtful approach to data serialization can achieve significant performance improvements for HBase deployments.

The question of using tall versus wide tables in Apache HBase is a commonly discussed design pattern (see reference here and here). However, there are more considerations here than making that simple choice. Because HBase stores each column of a table as an independent row in the underlying HFiles, significant storage overhead can occur when storing small pieces of information.

Read more

Apache Impala (incubating) in CDH 5.7: 4x Faster for BI Workloads on Apache Hadoop

Categories: CDH Impala Performance

Impala 2.5, now shipping in CDH 5.7, brings significant performance improvements and some highly requested features.

Impala has proven to be a high-performance analytics query engine since the beginning. Even as an initial production release in 2013, it demonstrated performance 2x faster than a traditional DBMS, and each subsequent release has continued to demonstrate the wide performance gap between Impala’s analytic-database architecture and SQL-on-Apache Hadoop alternatives.

Read more

Benchmarking Apache Parquet: The Allstate Experience

Categories: Avro Parquet Performance

Our thanks to Don Drake (@dondrake), an independent technology consultant who is currently working at Allstate Insurance, for the guest post below about his experiences comparing use of the Apache Avro and Apache Parquet file formats with Apache Spark.

Over the last few months, numerous hallway conversations, informal discussions, and meetings have occurred at Allstate about the relative merits of different file formats for data stored in Apache Hadoop—including CSV,

Read more