I recently had a chat with Benjamin Bengfort, a data scientist finishing his PhD at the University of Maryland, and Jenny Kim, a software engineer at Cloudera, about their forthcoming O’Reilly Media book (now in Early Access), Data Analytics with Hadoop: An Introduction for Data Scientists.
Why did you decide to write this book?
Ben: The content was originally part of a class that Jenny and I were teaching together.
Get an update on the progress of the effort to bring erasure coding to HDFS, including a report about fresh performance benchmark testing results.
About a year ago, the Apache Hadoop community began the HDFS-EC project to build native erasure coding support inside HDFS (currently targeted for the 2.9/3.0 release). Since then, we have designed and implemented basic functionalities in the first phase of the project under HDFS-7285,
Which topics interested attracted the most community interest in 2015? Find out below.
It’s our annual custom to bring you a list of this blog’s most popular posts of the year. (See the 2013 and 2014 versions.) Why? Because this list reflects interests across the ecosystem; it’s one of the best passive surveys we have, actually.
As usual, when drawing conclusions, be sure to account for data skew.
Recent improvements to Apache Hadoop’s native backup utility, which are now shipping in CDH, make that process much faster.
DistCp is a popular tool in Apache Hadoop for periodically backing up data across and within clusters. (Each run of DistCp in the backup process is referred to as a backup cycle.) Its popularity has grown in popularity despite relatively slow performance.
In this post, we’ll provide a quick introduction to DistCp.
Impala is designed to deliver insight on data in Apache Hadoop in real time. As data often lands in Hadoop continuously in certain use cases (such as time-series analysis, real-time fraud detection, real-time risk detection, and so on), it’s desirable for Impala to query this new “fast” data with minimal delay and without interrupting running queries.
In this blog post, you will learn an approach for continuous loading of data into Impala via HDFS,