Tag Archives: Support

Announcing Cloudera’s Distribution for Apache Hadoop

Categories: Community General Hadoop

One of the repeating themes we have heard while working with our customers and the community is that Apache Hadoop configuration and deployment is a pain. Often times, Hadoop is the first truly distributed system that administrators encounter, and the problem is made worse by the lack of standardized packages and deployment tools. And some releases are buggy. And upgrades are hard. And the list goes on.

In order for Hadoop to truly disrupt the enterprise,

Read more

The Small Files Problem

Categories: General Hadoop

Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topic is anything to go by. In this post I’ll look at the problem, and examine some common solutions.

Problems with small files and HDFS

A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files,

Read more

Job Scheduling in Apache Hadoop

Categories: Hadoop MapReduce

(guest blog post by Matei Zaharia)

When Apache Hadoop started out, it was designed mainly for running large batch jobs such as web indexing and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations placed more data in their Hadoop clusters and developed more computations they wanted to run, another use case became attractive: sharing a MapReduce cluster between multiple users.

Read more