Category Archives: Tools

Quality Assurance at Cloudera: Highly-Controlled Disk Injection

Categories: CDH Testing Tools

Recently installed fault-injection techniques are making quality assurance processes yet more rigorous.

In a previous installment of our series about quality assurance inside Cloudera, we described the fault-injection frameworks (AgenTEST and Sapper) that Cloudera Engineering has devised. The fault-injection framework starts and stops injections, to determine when and how they should occur, respectively.

On that occasion, we presented a number of disk-related injections implemented in AgenTEST, including:

  • BurnIO: Runs disk-intensive processes,

Read More

Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive

Categories: Data Science Hive Impala Tools

This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.

Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Hive and Impala.

Read More

Quality Assurance at Cloudera: Distributed Unit Testing

Categories: Kudu Testing Tools

Cloudera Engineering has developed (and recently open sourced) a distributed unit testing framework that cuts testing time from multiple hours to just 10 minutes.

Upstream unit tests are Cloudera’s first line of defense for finding and fixing software bugs, as part of a multidimensional process that also includes static/dynamic code analysis, fault injection, integration/scale/endurance testing, and validation on real workloads. However, running a full unit test suite for Apache Hadoop ecosystem components can take hours,

Read More

How-to: Create and Use a Custom Formatter in the Apache HBase Shell

Categories: Avro HBase How-to Tools

Learn how improve Apache HBase usability by creating a custom formatter for viewing binary data types in the HBase shell.

Cloudera customers are looking to store complex data types in Apache HBase to provide fast retrieval of complex information such as banking transactions, web analytics records, and related metadata associated with those records. Serialization formats such as Apache Avro, Thrift, and Protocol Buffers greatly assist in meeting this goal,

Read More

DistCp Performance Improvements in Apache Hadoop

Categories: CDH Hadoop HDFS Performance Tools

Recent improvements to Apache Hadoop’s native backup utility, which are now shipping in CDH, make that process much faster.

DistCp is a popular tool in Apache Hadoop for periodically backing up data across and within clusters. (Each run of DistCp in the backup process is referred to as a backup cycle.) Its popularity has grown in popularity despite relatively slow performance.

In this post, we’ll provide a quick introduction to DistCp.

Read More