Author Archives: Tom White

Spark Dataflow Joins Google’s Dataflow SDK

Categories: Cloud Cloudera Labs General Spark

Spark Dataflow from Cloudera Labs is now part of Google’s New Dataflow SDK, which will be proposed to the Apache Incubator.

Spark Dataflow is an experimental implementation of Google’s Dataflow programming model that runs on Apache Spark. The initial implementation was written by Josh Wills, and entered Cloudera Labs exactly a year ago. Since then, we’ve seen a number of contributions to the project, culminating in the recent addition of an implementation of streaming (running on Spark Streaming) by Amit Sela from PayPal.

Read More

Apache Hadoop 2.0.3-alpha Released

Categories: General Hadoop HDFS MapReduce YARN

Last week the Apache Hadoop PMC voted to release Apache Hadoop 2.0.3-alpha, the latest in the Hadoop 2 release series. This release fixes over 500 issues (covering the Common, HDFS, MapReduce and YARN sub-projects) since the 2.0.2-alpha release in October last year. In addition to bug fixes and general improvements the more noteworthy changes include:

  • HDFS High Availability (HA) can now use a Quorum Journal Manager (QJM) for sharing namenode edit logs (HDFS-3077).

Read More

Apache Hadoop 2.0.2-alpha Released

Categories: Community Hadoop MapReduce

Earlier this month the Apache Hadoop PMC released Apache Hadoop 2.0.2-alpha, which fixes over 600 issues since the previous release in the 2.0 series, 2.0.1-alpha, back in July. This is a tremendous rate of development, of which all contributors to the project should feel proud.

Some of the more noteworthy changes in this release include:

  • HDFS HA supports automatic failover using ZooKeeper (HDFS-3042).

Read More

Apache Hadoop 0.23.0 has been released

Categories: General Hadoop

The Apache Hadoop PMC has voted to release Apache Hadoop 0.23.0. This release is significant since it is the first major release of Hadoop in over a year, and incorporates many new features and improvements over the 0.20 release series. The biggest new features are HDFS federation, and a new MapReduce framework. There is also a new build system (Maven), Kerberos HTTP SPNEGO support, as well as some significant performance improvements which we’ll be covering in future posts.

Read More

Snappy and Hadoop

Categories: Community General Hadoop

Snappy is a compression library developed at Google, and, like many technologies that come from Google, Snappy was designed to be fast. The trade off is that the compression ratio is not as high as other compression libraries. From the Snappy homepage:

… compared to the fastest mode of zlib, Snappy is an order of magnitude faster for most inputs, but the resulting compressed files are anywhere from 20% to 100% bigger.

Read More