Cloudera Engineering Blog

Big Data best practices, how-to's, and internals from Cloudera Engineering and the community


Big Data Benchmarks: Toward Real-Life Use Cases

The Transaction Processing Council (TPC), working with Cloudera, recently announced the new TPCx-HS benchmark, a good first step toward providing a Big Data benchmark.

In this interview by Roberto Zicari with Francois Raab, the original author of the TPC-C Benchmark, and Yanpei Chen, a Performance Engineer at Cloudera, the interviewees share their thoughts on the next step for benchmarks that reflect real-world use cases.

Running CDH 5 on GlusterFS 3.3

The following post was written by Jay Vyas (@jayunit100) and originally published in the Gluster.org Community.

I have recently spent some time getting Cloudera’s CDH 5 distribution of Apache Hadoop to work on GlusterFS 3.3 using Distributed Replicated 2 Volumes. This is made possible by the fact that Apache Hadoop has a pluggable filesystem architecture that allows the computational components within the CDH 5 distribution to be configured to use alternative filesystems to HDFS. In this case, one can configure CDH 5 to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), which allows it to run on GlusterFS 3.3. I’ve provided a diagram below that illustrates the CDH 5 core processes and how they interact with GlusterFS.

How-to: Count Events Like a Data Scientist

The ability to quickly and accurately count complex events is a legitimate business advantage.

In our work as data scientists, we spend most of our time counting things. It is the foundational skill that is used in data cleansing, reporting, feature engineering, and simple-but-effective machine learning models like Naive Bayes classifiers. Hilary Mason has a quote about the benefits of counting that I love:

Apache Hadoop 2.5.0 is Released

The Apache Hadoop community has voted to release Apache Hadoop 2.5.0.

Apache Hadoop 2.5.0 is a minor release in the 2.x release line and includes some major features and improvements, including:

How-to: Use IPython Notebook with Apache Spark

IPython Notebook and Spark’s Python API are a powerful combination for data science.

The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.

New in CDH 5.1: HDFS Read Caching

Applications using HDFS, such as Impala, will be able to read data up to 59x faster thanks to this new feature.

Server memory capacity and bandwidth have increased dramatically over the last few years. Beefier servers make in-memory computation quite attractive, since a lot of interesting data sets can fit into cluster memory, and memory is orders of magnitude faster than disk.

This Month in the Ecosystem (July 2014)

Welcome to our 11th edition of “This Month in the Ecosystem,” a digest of highlights from July 2014 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly).

Progress Report: Cloudera Community Forums After One Year

Cloudera Community forums are proving their value as an important contributor to a rich user experience.

It’s been almost exactly one year since the debut of the Cloudera Community forums. In addition to doing the birthday shout-out, I thought it would be interesting to bring you up to date about adoption and usage patterns.

Meet the Engineer: Sravya Tirukkovalur

Meet Sravya Tirukkovalur (@sravsatuluri), a Software Engineer working on Apache Hadoop security at Cloudera.

What do you do at Cloudera, and in which Apache projects are you involved?

New in CDH 5.1: Hue’s Improved Search App

An improved Search app in Hue 3.6 makes the Hadoop user experience even better.

Hue 3.6 (now packaged in CDH 5.1) has brought the second version of the Search App up to even higher standards. The user experience has been greatly improved, as the app now provides a very easy way to build custom dashboards and visualizations.

Newer Posts Older Posts