Introducing Cloudera Labs: An Open Look into Cloudera Engineering R&D

Categories: CDH Cloudera Labs Community

Cloudera Labs contains ecosystem innovations that one day may bring developers more functionality or productivity in CDH.

Since its inception, one of the defining characteristics of Apache Hadoop has been its ability to evolve/reinvent and thrive at the same time. For example, two years ago, nobody could have predicted that the formative MapReduce engine, one of the cornerstones of “original” Hadoop, would be marginalized or even replaced. Yet today, that appears to be happening via Apache Spark, with Hadoop becoming the stronger for it. Similarly, we’ve seen other relatively new components, like Impala, Apache Parquet (incubating), and Apache Sentry (also incubating), become widely adopted in relatively short order.

Cloudera Labs

This unique characteristic requires Cloudera to be highly sensitive to new activity at the “edges” of the ecosystem — in other words, to be vigilant for the abrupt arrival of new developer requirements, and new components or features that meet them. (In fact, Cloudera employees are often the creators of such solutions.) When there is sufficient market interest and customer success with them seems assured, these new components often join the Cloudera platform as shipping product.

Today, we are announcing a new program that externalizes this thought process: Cloudera Labs ( Cloudera Labs is a virtual container for innovations being incubated within Cloudera Engineering, with the goal of bringing more use cases, productivity, or other types of value to developers by constantly exploring new solutions for their problems. Although Labs initiatives are not supported or intended for production use, you may find them interesting for experimentation or personal projects, and we encourage your feedback about their usefulness to you. (Note that inclusion in Cloudera Labs is not a precondition for productization, either.)

Apache Kafka is among the “charter members” of this program. Since its origin as proprietary LinkedIn infrastructure just a couple years ago for highly scalable and resilient real-time data transport, it’s now one of the hottest projects associated with Hadoop. To stimulate feedback about Kafka’s role in enterprise data hubs, today we are making a Kafka-Cloudera Labs parcel (unsupported) available for installation.

Other initial Labs projects include:

  • Exhibit
    Exhibit is a library of Apache Hive UDFs that usefully let you treat array fields within a Hive row as if they were “mini-tables” and then execute SQL statements against them for deeper analysis.
  • Hive-on-Spark Integration
    A broad community effort is underway to bring Apache Spark-based data processing to Apache Hive, reducing query latency considerably and allowing IT to further standardize on Spark for data processing.
  • Impyla
    Impyla is a Python (2.6 and 2.7) client for Impala, the open source MPP query engine for Hadoop. It communicates with Impala using the same standard protocol as ODBC/JDBC drivers.
  • Oryx
    Oryx, a project jointly spearheaded by Cloudera Engineering and Intel, provides simple, real-time infrastructure for large-scale machine learning/predictive analytics applications.
  • RecordBreaker
    RecordBreaker, a project jointly developed by Hadoop co-founder Mike Cafarella and Cloudera, automatically turns your text-formatted data into structured Avro data–dramatically reducing data prep time.

As time goes on, and some of the projects potentially graduate into CDH components (or otherwise remain as Labs projects), more names will join the list. And of course, we’re always interested in hearing your suggestions for new Labs projects.