Tag Archives: sizing

Latest Impala Cookbook

Categories: Impala

Over the past year (and through several releases), Apache Impala (incubating) has added numerous new features and performance enhancements better enabling high-performance SQL analytics over big data.  Thus, it is time again for an update to the Impala cookbook, which contains best practices for these new features, updated guidelines, and more detailed examples.

Note: This cookbook does not yet capture best practices for the major new advancements available with the recent GA of Kudu.

Read more

How-to: Tune Your Apache Spark Jobs (Part 2)

Categories: How-to Spark

In the conclusion to this series, learn how resource tuning, parallelism, and data representation affect Spark job performance.

In this post, we’ll finish what we started in “How to Tune Your Apache Spark Jobs (Part 1)”. I’ll try to cover pretty much everything you could care to know about making a Spark program run fast. In particular, you’ll learn about resource tuning, or configuring Spark to take advantage of everything the cluster has to offer.

Read more

How-to: Deploy and Configure Apache Kafka in Cloudera Enterprise

Categories: How-to Kafka

With Kafka now formally integrated with, and supported as part of, Cloudera Enterprise, what’s the best way to deploy and configure it?

Earlier today, Cloudera announced that, following an incubation period in Cloudera Labs, Apache Kafka is now fully integrated into Cloudera’s Big Data platform, Cloudera Enterprise (CDH + Cloudera Manager). Our customers have expressed strong interest in Kafka, and some are already running Kafka in production.

Read more

The Impala Cookbook

Categories: Impala

Bookmark this new living document to ensure use of current and proper configuration, sizing, management, and measurement practices.

Impala, the open source MPP analytic database for Apache Hadoop, is now firmly entrenched in the Big Data mainstream. How do we know this? For one, Impala is now the standard against which alternatives measure themselves, based on a proliferation of new benchmark testing. Furthermore, Impala has been adopted by multiple vendors as their solution for letting customers do exploratory analysis on Big Data,

Read more

Hadoop/HBase Capacity Planning

Categories: Hadoop HBase HDFS MapReduce ZooKeeper

Apache Hadoop and Apache HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use.  This blog is to provide guidance in sizing your first Hadoop/HBase cluster.  First, there are significant differences in Hadoop and HBase usage.  Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum).

Read more