Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
Our thanks to Mayur Rustagi (@mayur_rustagi), CTO at Sigmoid Analytics, for allowing us to re-publish his post about the Spork (Pig-on-Spark) project below. (Related: Read about the ongoing upstream to bring Spark-based data processing to Hive here.)
Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis. At Sigmoid Analytics, we want to streamline this data processing pipeline so that analysts can truly focus on value generation and not data preparation.
The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.
The versatility of Apache Spark’s API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the real world.
Few things help you concentrate like a last-minute change to a major project.
Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache Spark.
During my internship at Cloudera, I have been working on integrating PyMC with Apache Spark. PyMC is an open source Python package that allows users to easily apply Bayesian machine learning methods to their data, while Spark is a new, general framework for distributed computing on Hadoop. Together, they provide a scalable framework for scalable Markov Chain Monte Carlo (MCMC) methods. In this blog post, I am going to describe my work on distributing large-scale graphical models and MCMC computation.
Markov Chain Monte Carlo Methods
Impala 2.0 will add much more complete SQL functionality to what is already the fastest SQL-on-Hadoop solution available.
In September 2013, we provided a roadmap for Impala — the open source MPP SQL query engine for Apache Hadoop, which was on release 1.1 at the time — that documented planned functionality through release 2.0 and beyond.
Our thanks to Rakesh Rao of Quaero, for allowing us to re-publish the post below about Quaero’s experiences using partitioning in Apache Hive.
In this post, we will talk about how we can use the partitioning features available in Hive to improve performance of Hive queries.
Congratulations to Hari Shreedharan, Cloudera software engineer and Apache Flume committer/PMC member, for the early release of his new O’Reilly Media book, Using Flume: Stream Data into HDFS and HBase. It’s the seventh Hadoop ecosystem book so far that was authored by a current or former Cloudera employee (but who’s counting?).
Why did you decide to write this book?
The Transaction Processing Council (TPC), working with Cloudera, recently announced the new TPCx-HS benchmark, a good first step toward providing a Big Data benchmark.
In this interview by Roberto Zicari with Francois Raab, the original author of the TPC-C Benchmark, and Yanpei Chen, a Performance Engineer at Cloudera, the interviewees share their thoughts on the next step for benchmarks that reflect real-world use cases.
The following post was written by Jay Vyas (@jayunit100) and originally published in the Gluster.org Community.
I have recently spent some time getting Cloudera’s CDH 5 distribution of Apache Hadoop to work on GlusterFS 3.3 using Distributed Replicated 2 Volumes. This is made possible by the fact that Apache Hadoop has a pluggable filesystem architecture that allows the computational components within the CDH 5 distribution to be configured to use alternative filesystems to HDFS. In this case, one can configure CDH 5 to use the Hadoop FileSystem plugin for GlusterFS (glusterfs-hadoop), which allows it to run on GlusterFS 3.3. I’ve provided a diagram below that illustrates the CDH 5 core processes and how they interact with GlusterFS.
The ability to quickly and accurately count complex events is a legitimate business advantage.
In our work as data scientists, we spend most of our time counting things. It is the foundational skill that is used in data cleansing, reporting, feature engineering, and simple-but-effective machine learning models like Naive Bayes classifiers. Hilary Mason has a quote about the benefits of counting that I love: