Cloudera Engineering Blog · General Posts

BigBench: Toward An Industry-Standard Benchmark for Big Data Analytics

Learn about BigBench, the new industrywide effort to create a sorely needed Big Data benchmark.

Benchmarking Big Data systems is an open problem. To address this concern, numerous hardware and software vendors are working together to create a comprehensive end-to-end big data benchmark suite called BigBench. BigBench builds upon and borrows elements from existing benchmarking efforts in the Big Data space (such as YCSB, TPC-xHS, GridMix, PigMix, HiBench, Big Data Benchmark, and TPC-DS). Intel and Cloudera, along with other industry partners, are working to define and implement extensions to BigBench 1.0. (A TPC proposal for BigBench 2.0 is in the works.)

BigBench Overview

Using Impala, Amazon EMR, and Tableau to Analyze and Visualize Data

Our thanks to AWS Solutions Architect Rahul Bhartia for allowing us to republish his post below.

Apache Hadoop provides a great ecosystem of tools for extracting value from data in various formats and sizes. Originally focused on large-batch processing with tools like MapReduce, Apache Pig, and Apache Hive, Hadoop now provides many tools for running interactive queries on your data, such as Impala, Drill, and Presto. This post shows you how to use Amazon Elastic MapReduce (Amazon EMR) to analyze a data set available on Amazon Simple Storage Service (Amazon S3) and then use Tableau with Impala to visualize the data.

The Definitive "Getting Started" Tutorial for Apache Hadoop + Your Own Demo Cluster

Using this new tutorial alongside Cloudera Live is now the fastest, easiest, and most hands-on way to get started with Hadoop.

At Cloudera, developer enablement is one of our most important objectives. One only has to look at examples from history (Java or SQL, for example) to know that knowledge fuels the ecosystem. That objective is what drives initiatives such as our community forums, the Cloudera QuickStart VM, and this blog itself.

Community Meetups during Strata + Hadoop World 2014

The meetup opportunities during the conference week are more expansive than ever — spanning Impala, Spark, HBase, Kafka, and more.

Strata + Hadoop World 2014 is a kaleidoscope of experiences for attendees, and those experiences aren’t contained within the conference center’s walls. For example, the meetups that occur during the conf week (which is concurrent with NYC DataWeek) are a virtual track for developers — and with Strata + Hadoop World being bigger than ever, so is the scope of that track.

How-to: Install CDH on Mac OSX 10.9 Mavericks

This overview will cover the basic tarball setup for your Mac.

If you’re an engineer building applications on CDH and becoming familiar with all the rich features for designing the next big solution, it becomes essential to have a native Mac OSX install. Sure, you may argue that your MBP with its four-core, hyper-threaded i7, SSD, 16GB of DDR3 memory are sufficient for spinning up a VM, and in most instances — such as using a VM for a quick demo — you’re right.  However, when experimenting with a slightly heavier workload that is a bit more resource intensive, you’ll want to explore a native install.

Bayesian Machine Learning on Apache Spark

Markov Chain Monte Carlo methods are another example of useful statistical computation for Big Data that is capably enabled by Apache Spark.

During my internship at Cloudera, I have been working on integrating PyMC with Apache Spark. PyMC is an open source Python package that allows users to easily apply Bayesian machine learning methods to their data, while Spark is a new, general framework for distributed computing on Hadoop. Together, they provide a scalable framework for scalable Markov Chain Monte Carlo (MCMC) methods. In this blog post, I am going to describe my work on distributing large-scale graphical models and MCMC computation.

Markov Chain Monte Carlo Methods

Cloudera Live: The Instant Apache Hadoop Experience

Get started with Apache Hadoop and use-case examples online in just seconds.

Today, we announced Cloudera Live, a new online service for developers and analysts (currently in public beta) that makes it easy to learn, explore, and try out CDH, Cloudera’s open source software distribution containing Apache Hadoop and related projects. No downloads, no installations, no waiting — just point-and-play!

How-to: Implement Role-based Security in Impala using Apache Sentry

This quick demo illustrates how easy it is to implement role-based access and control in Impala using Sentry.

Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.

Apache Spark: A Delight for Developers

Sure, Spark is fast, but it also gives developers a positive experience they won’t soon forget.

Apache Spark is well known today for its performance benefits over MapReduce, as well as its versatility. However, another important benefit – the elegance of the development experience – gets less mainstream attention.

How-to: Index and Search Multilingual Documents in Hadoop

Learn how to use Cloudera Search along with RBL-JE to search and index documents in multiple languages.

Our thanks to Basis Technology for providing the how-to below!

Older Posts