Tag Archives: Support

The Small Files Problem

Categories: General Hadoop

Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topic is anything to go by. In this post I’ll look at the problem, and examine some common solutions.

Problems with small files and HDFS

A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files,

Read more

Job Scheduling in Apache Hadoop

Categories: Hadoop MapReduce

(guest blog post by Matei Zaharia)

When Apache Hadoop started out, it was designed mainly for running large batch jobs such as web indexing and log mining. Users submitted jobs to a queue, and the cluster ran them in order. However, as organizations placed more data in their Hadoop clusters and developed more computations they wanted to run, another use case became attractive: sharing a MapReduce cluster between multiple users.

Read more