HDFS erasure coding (EC), a major feature delivered in Apache Hadoop 3.0, is also available in CDH 6.1 for use in certain applications like Spark, Hive, and MapReduce. The development of EC has been a long collaborative effort across the wider Hadoop community. Including EC with CDH 6.1 helps customers adopt this new feature by […]
In this multipart series, fully explore the tangled ball of thread that is YARN. YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. YARN has been available for several releases, but many users still have fundamental questions about what YARN is, what it’s for, and how it works. This […]
In Part 1 of this series about Apache HBase snapshots, you learned how to use the new Snapshots feature and a bit of theory behind the implementation. Now, it’s time to dive into the technical details a bit more deeply. What is a Table? An HBase table comprises a set of metadata information and a set […]
“My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you. Java requires third-party and user-defined classes to be on the command line’s “–classpath” option when the JVM is launched. The `hadoop` wrapper shell script does […]
Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topic is anything to go by. In this post I’ll look at the problem, and examine some common solutions. Problems with small files and HDFS A small file is one […]