Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
The Transaction Processing Council (TPC), working with Cloudera, recently announced the new TPCx-HS benchmark, a good first step toward providing a Big Data benchmark.
In this interview by Roberto Zicari with Francois Raab, the original author of the TPC-C Benchmark, and Yanpei Chen, a Performance Engineer at Cloudera, the interviewees share their thoughts on the next step for benchmarks that reflect real-world use cases.
The following post was written by Jay Vyas (@jayunit100) and originally published in the Gluster.org Community.
I have recently spent some time getting Cloudera’s CDH 5 distribution of Apache Hadoop to work on GlusterFS 3.3 using Distributed Replicated 2 Volumes. This is made possible by the fact that Apache Hadoop has a pluggable filesystem architecture that allows the computational components within the CDH 5 distribution to be configured to use alternative filesystems to HDFS. In this case, one can configure CDH 5 to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), which allows it to run on GlusterFS 3.3. I’ve provided a diagram below that illustrates the CDH 5 core processes and how they interact with GlusterFS.
The ability to quickly and accurately count complex events is a legitimate business advantage.
In our work as data scientists, we spend most of our time counting things. It is the foundational skill that is used in data cleansing, reporting, feature engineering, and simple-but-effective machine learning models like Naive Bayes classifiers. Hilary Mason has a quote about the benefits of counting that I love:
The Apache Hadoop community has voted to release Apache Hadoop 2.5.0.
Apache Hadoop 2.5.0 is a minor release in the 2.x release line and includes some major features and improvements, including:
IPython Notebook and Spark’s Python API are a powerful combination for data science.
The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.
Applications using HDFS, such as Impala, will be able to read data up to 59x faster thanks to this new feature.
Server memory capacity and bandwidth have increased dramatically over the last few years. Beefier servers make in-memory computation quite attractive, since a lot of interesting data sets can fit into cluster memory, and memory is orders of magnitude faster than disk.
Cloudera Community forums are proving their value as an important contributor to a rich user experience.
It’s been almost exactly one year since the debut of the Cloudera Community forums. In addition to doing the birthday shout-out, I thought it would be interesting to bring you up to date about adoption and usage patterns.
Meet Sravya Tirukkovalur (@sravsatuluri), a Software Engineer working on Apache Hadoop security at Cloudera.
What do you do at Cloudera, and in which Apache projects are you involved?
An improved Search app in Hue 3.6 makes the Hadoop user experience even better.
Hue 3.6 (now packaged in CDH 5.1) has brought the second version of the Search App up to even higher standards. The user experience has been greatly improved, as the app now provides a very easy way to build custom dashboards and visualizations.