Category Archives: Hadoop

Converting Apache Avro Data to Parquet Format in Apache Hadoop

Categories: Avro Guest Hadoop Parquet

Thanks to Big Data Solutions Architect Matthieu Lieber for allowing us to republish the post below.

A customer of mine wants to take advantage of both worlds: work with his existing Apache Avro data, with all of the advantages that it confers, but take advantage of the predicate push-down features that Parquet provides. How to reconcile the two?

For more information about combining these formats,

Read More

Understanding HDFS Recovery Processes (Part 2)

Categories: Hadoop HDFS

Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop. In the conclusion to this two-part post, pipeline recovery is explained.

An important design requirement of HDFS is to ensure continuous and correct operations that support production deployments. For that reason, it’s important for operators to understand how HDFS recovery processes work. In Part 1 of this post, we looked at lease recovery and block recovery.

Read More

Understanding HDFS Recovery Processes (Part 1)

Categories: Hadoop HDFS

Having a good grasp of HDFS recovery processes is important when running or moving toward production-ready Apache Hadoop.

An important design requirement of HDFS is to ensure continuous and correct operations to support production deployments. One particularly complex area is ensuring correctness of writes to HDFS in the presence of network and node failures, where the lease recovery, block recovery, and pipeline recovery processes come into play. Understanding when and why these recovery processes are called,

Read More

Couchdoop: Couchbase Meets Apache Hadoop

Categories: Guest Hadoop

Thanks to Călin-Andrei Burloiu, Big Data Engineer at antivirus company Avira, and Radu Pastia, Senior Software Developer in the Big Data Team at Orange, for the guest post below about the Couchdoop connector for bringing Couchbase data into Hadoop.

Couchdoop is a Couchbase connector for Apache Hadoop, developed by Avira on CDH, that allows for easy, parallel data transfer between Couchbase and Hadoop storage engines. It includes a command-line tool,

Read More

This Month in the Ecosystem (January 2015)

Categories: Community Hadoop

Welcome to our 16th edition of “This Month in the Ecosystem,” a digest of highlights from January 2015 (never intended to be comprehensive; for that, see the excellent Hadoop Weekly). 

You may have noticed that this report went on hiatus for December 2014 due to a lack of critical news mass (plus, we realize that most of you are out of the loop until mid-January).

Read More