Before CDH 5.10, every CDH cluster had to have its own Apache Hive Metastore (HMS) backend database. This model is ideal for clusters where each cluster contains the data locally along with the metadata. In the cloud, however, many CDH clusters run directly on a shared object store (like Amazon S3), making it possible for the data to live across multiple clusters and beyond any cluster’s lifespan. In this scenario clusters need to regenerate and coordinate metadata for the underlying shared data individually.
Over the past year (and through several releases), Apache Impala (incubating) has added numerous new features and performance enhancements better enabling high-performance SQL analytics over big data. Thus, it is time again for an update to the Impala cookbook, which contains best practices for these new features, updated guidelines, and more detailed examples.
Note: This cookbook does not yet capture best practices for the major new advancements available with the recent GA of Kudu.
After the GA of Apache Kudu in Cloudera CDH 5.10, we take a look at the Apache Spark on Kudu integration, share code snippets, and explain how to get up and running quickly, as Kudu is already a first-class citizen in Spark’s ecosystem.
As the Apache Kudu development team celebrates the initial 1.0 release launched on September 19, and the most recent 1.2.0 version now GA as part of Cloudera’s CDH 5.10 release,
Apache Impala (incubating) includes several features that allow you to restrict or allocate resources so as to maximize stability and performance for your Impala workloads. You can limit both CPU and memory resources used by Impala to manage and prioritize jobs on CDH clusters. This blog post describes the techniques a typical Impala deployment can use to manage its resources.
Static Service Pools
Static service pools isolate services from one another, so that a high load on one service has limited impact on other services.
Impala users can expect new performance and usability benefits via improved integration with Kudu.
It’s been nearly one year since the public beta announcement of Kudu (now a top-level Apache project) and a noteworthy milestone has been reached: its 1.0 release. This is particularly exciting as Kudu extends the use cases that can be supported on the Apache Hadoop platform, whether it be on-premises or in the cloud,