Cloudera Engineering Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
Hadoop Security is the latest book from Cloudera engineers in the Hadoop ecosystem books canon.
We are thrilled to announce the availability of the early release of Hadoop Security, a new book about security in the Apache Hadoop ecosystem published by O’Reilly Media. The early release contains two chapters on System Architecture and Securing Data Ingest and is available in O’Reilly’s catalog and in Safari Books.
Our thanks to Mayur Rustagi (@mayur_rustagi), CTO at Sigmoid Analytics, for allowing us to re-publish his post about the Spork (Pig-on-Spark) project below. (Related: Read about the ongoing upstream to bring Spark-based data processing to Hive here.)
Analysts can talk about data insights all day (and night), but the reality is that 70% of all data analyst time goes into data processing and not analysis. At Sigmoid Analytics, we want to streamline this data processing pipeline so that analysts can truly focus on value generation and not data preparation.
The key to getting the most out of Spark is to understand the differences between its RDD API and the original Mapper and Reducer API.
Venerable MapReduce has been Apache Hadoop‘s work-horse computation paradigm since its inception. It is ideal for the kinds of work for which Hadoop was originally designed: large-scale log processing, and batch-oriented ETL (extract-transform-load) operations.
The versatility of Apache Spark’s API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the real world.
Few things help you concentrate like a last-minute change to a major project.
Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache Spark.
During my internship at Cloudera, I have been working on integrating PyMC with Apache Spark. PyMC is an open source Python package that allows users to easily apply Bayesian machine learning methods to their data, while Spark is a new, general framework for distributed computing on Hadoop. Together, they provide a scalable framework for scalable Markov Chain Monte Carlo (MCMC) methods. In this blog post, I am going to describe my work on distributing large-scale graphical models and MCMC computation.
Markov Chain Monte Carlo Methods
Impala 2.0 will add much more complete SQL functionality to what is already the fastest SQL-on-Hadoop solution available.
In September 2013, we provided a roadmap for Impala — the open source MPP SQL query engine for Apache Hadoop, which was on release 1.1 at the time — that documented planned functionality through release 2.0 and beyond.
Our thanks to Rakesh Rao of Quaero, for allowing us to re-publish the post below about Quaero’s experiences using partitioning in Apache Hive.
In this post, we will talk about how we can use the partitioning features available in Hive to improve performance of Hive queries.
Congratulations to Hari Shreedharan, Cloudera software engineer and Apache Flume committer/PMC member, for the early release of his new O’Reilly Media book, Using Flume: Stream Data into HDFS and HBase. It’s the seventh Hadoop ecosystem book so far that was authored by a current or former Cloudera employee (but who’s counting?).
Why did you decide to write this book?
The Transaction Processing Council (TPC), working with Cloudera, recently announced the new TPCx-HS benchmark, a good first step toward providing a Big Data benchmark.
In this interview by Roberto Zicari with Francois Raab, the original author of the TPC-C Benchmark, and Yanpei Chen, a Performance Engineer at Cloudera, the interviewees share their thoughts on the next step for benchmarks that reflect real-world use cases.