Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compared to replication while maintaining the same durability guarantees. This post explains how it works.
HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. It also eases scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from.
In this multipart series, fully explore the tangled ball of thread that is YARN.
YARN (Yet Another Resource Negotiator) is the resource management layer for the Apache Hadoop ecosystem. YARN has been available for several releases, but many users still have fundamental questions about what YARN is, what it’s for, and how it works. This new series of blog posts is designed with the following goals in mind:
- Provide a basic understanding of the components that make up YARN
- Illustrate how a MapReduce job fits into the YARN model of computation.
Learn about the new functionality coming aboard Cloudera Navigator, the trail-blazing solution for metadata management and lineage in Apache Hadoop.
More than two years ago, Cloudera introduced Cloudera Navigator 1.0, which was the first offering to unify auditing across enterprise Apache Hadoop deployments. About a year later, Cloudera released Cloudera Navigator 2.0, which introduced another first for Hadoop: comprehensive metadata management and lineage to Hadoop. Today, more than 200 customers across numerous industries use Cloudera Navigator in production to deliver trust and visibility to their Hadoop deployments.
The Strata + Hadoop World NYC 2015 (Sept. 29-Oct. 3) agenda was published in the last few days. Congratulations to all accepted presenters!
In this post, I just want to provide a concise digest of the tutorials and sessions that will involve Cloudera or Intel engineers and/or interesting use cases. There are many worthy sessions from which to choose, so we hope this list will influence your decisions about where to spend your time during the week!
Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns,