Cloudera Engineering Blog · Hardware Posts

Using Apache Hadoop and Impala with MySQL for Data Analysis

Thanks to Alexander Rubin of Percona for allowing us to re-publish the post below!

Apache Hadoop is commonly used for data analysis. It is fast for data loads and scalable. In a previous post I showed how to integrate MySQL with Hadoop. In this post I will show how to export a table from  MySQL to Hadoop, load the data to Cloudera Impala (columnar format), and run reporting on top of that. For the examples below, I will use the “ontime flight performance” data from my previous post.

The Truth About MapReduce Performance on SSDs

Cost-per-performance, not cost-per-capacity, turns out to be the better metric for evaluating the true value of SSDs.

In the Big Data ecosystem, solid-state drives (SSDs) are increasingly considered a viable, higher-performance alternative to rotational hard-disk drives (HDDs). However, few results from actual testing are available to the public.

How-to: Select the Right Hardware for Your New Hadoop Cluster

One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.

Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)

Configuring Impala and MapReduce for Multi-tenant Performance

Cloudera Impala has many exciting features, but one of the most impressive is the ability to analyze data in multiple formats, with no ETL needed, in HDFS and Apache HBase. Furthermore, you can use multiple frameworks, such as MapReduce and Impala, to analyze that same data. Consequently, Impala will often run side-by-side with MapReduce on the same physical hardware, with both supporting business-critical workloads. For such multi-tenant clusters, Impala and MapReduce both need to perform well despite potentially conflicting demands for cluster resources.

In this post, we’ll share our experiences configuring Impala and MapReduce for optimal multi-tenant performance. Our goal is to help users understand how to tune their multi-tenant clusters to meet production service level objectives (SLOs), and to contribute to the community some test methods and performance models that can be helpful beyond Cloudera.

Defining Realistic Test Scenarios