Category Archives: Tools

implyr: R Interface for Apache Impala

Categories: CDH Data Science HBase HDFS Impala Kudu Tools

New R package implyr enables R users to query Impala using dplyr.

Apache Impala (incubating) enables low-latency interactive SQL queries on data stored in HDFS, Amazon S3, Apache Kudu, and Apache HBase. With the availability of the R package implyr on CRAN and GitHub, it’s now possible to query Impala from R using the popular package dplyr.

dplyr provides a grammar of data manipulation,

Read more

Quality Assurance at Cloudera: Highly-Controlled Disk Injection

Categories: CDH Testing Tools

Recently installed fault-injection techniques are making quality assurance processes yet more rigorous.

In a previous installment of our series about quality assurance inside Cloudera, we described the fault-injection frameworks (AgenTEST and Sapper) that Cloudera Engineering has devised. The fault-injection framework starts and stops injections, to determine when and how they should occur, respectively.

On that occasion, we presented a number of disk-related injections implemented in AgenTEST, including:

  • BurnIO: Runs disk-intensive processes,

Read more

Announcing hs2client, A Fast New C++ / Python Thrift Client for Impala and Hive

Categories: Data Science Hive Impala Tools

This new (alpha) C++ client library for Apache Impala (incubating) and Apache Hive provides high-performance data access from Python.

Earlier this year, members of the Python data tools and Impala teams at Cloudera began collaborating to create a new C++ library to eventually become a faster, more memory-efficient replacement for impyla, PyHive, and other (largely pure Python) client libraries for talking to Hive and Impala.

We are excited to release this effort,

Read more

Quality Assurance at Cloudera: Distributed Unit Testing

Categories: Kudu Testing Tools

Cloudera Engineering has developed (and recently open sourced) a distributed unit testing framework that cuts testing time from multiple hours to just 10 minutes.

Upstream unit tests are Cloudera’s first line of defense for finding and fixing software bugs, as part of a multidimensional process that also includes static/dynamic code analysis, fault injection, integration/scale/endurance testing, and validation on real workloads. However, running a full unit test suite for Apache Hadoop ecosystem components can take hours,

Read more

How-to: Create and Use a Custom Formatter in the Apache HBase Shell

Categories: Avro HBase How-to Tools

Learn how improve Apache HBase usability by creating a custom formatter for viewing binary data types in the HBase shell.

Cloudera customers are looking to store complex data types in Apache HBase to provide fast retrieval of complex information such as banking transactions, web analytics records, and related metadata associated with those records. Serialization formats such as Apache Avro, Thrift, and Protocol Buffers greatly assist in meeting this goal,

Read more