Cloudera Engineering Blog · Hadoop Posts

The Truth About MapReduce Performance on SSDs

Cost-per-performance, not cost-per-capacity, turns out to be the better metric for evaluating the true value of SSDs.

In the Big Data ecosystem, solid-state drives (SSDs) are increasingly considered a viable, higher-performance alternative to rotational hard-disk drives (HDDs). However, few results from actual testing are available to the public.

This Month in the Ecosystem (February 2014)

Welcome to our sixth edition of “This Month in the Ecosystem,” a digest of highlights from February 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

February being a short month, the list is relatively short — but never confuse quantity with quality!

A Guide to Checkpointing in Hadoop

Understanding how checkpointing works in HDFS can make the difference between a healthy cluster or a failing one.

Checkpointing is an essential part of maintaining and persisting filesystem metadata in HDFS. It’s crucial for efficient NameNode recovery and restart, and is an important indicator of overall cluster health. However, checkpointing can also be a source of confusion for operators of Apache Hadoop clusters.

Apache Hadoop 2.3.0 is Released (HDFS Caching FTW!)

Hadoop 2.3.0 includes hundreds of new fixes and features, but none more important than HDFS caching.

The Apache Hadoop community has voted to release Hadoop 2.3.0, which includes (among many other things):

How-to: Make Hadoop Accessible via LDAP

Integrating Hue with LDAP can help make your secure Hadoop apps as widely consumed as possible.

Hue, the open source Web UI that makes Apache Hadoop easier to use, easily integrates with your corporation’s existing identity management systems and provides authentication mechanisms for SSO providers. So, by changing a few configuration parameters, your employees can start analyzing Big Data in their own browsers under an existing security policy.

Getting MapReduce 2 Up to Speed

Thanks to the improvements described here, CDH 5 will ship with a version of MapReduce 2 that is just as fast (or faster) than MapReduce 1.

Performance fixes are tiny, easy, and boring, once you know what the problem is. The hard work is in putting your finger on that problem: narrowing, drilling down, and measuring, measuring, measuring.

Cloudera Enterprise 5 Beta 2 is Available: More New Features and Components

Cloudera has released the Beta 2 version of Cloudera Enterprise 5 (comprises CDH 5.0.0 and Cloudera Manager 5.0.0). 

This release (download) contains a number of new features and component versions including the ones below:

This Month in the Ecosystem (January 2014)

Welcome to our fifth edition of “This Month in the Ecosystem,” a digest of highlights from January 2014 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).

With the close of 2013, we also thought it appropriate to include some high points from across the year (not listed in any particular order):

How-to: Write and Run Apache Giraph Jobs on Apache Hadoop

Create a test environment for writing and testing Giraph jobs, or just for playing around with Giraph and small sample datasets.

Apache Giraph is a scalable, fault-tolerant implementation of graph-processing algorithms in Apache Hadoop clusters of up to thousands of computing nodes. Giraph is in use at companies like Facebook and PayPal, for example, to help represent and analyze the billions (or even trillions) of connections across massive datasets. Giraph was inspired by Google’s Pregel framework and integrates well with Apache Accumulo, Apache HBase, Apache Hive, and Cloudera Impala.

How-to: Create a Simple Hadoop Cluster with VirtualBox

Set up a CDH-based Hadoop cluster in less than an hour using VirtualBox and Cloudera Manager.

Thanks to Christian Javet for his permission to republish his blog post below!

Newer Posts Older Posts