In my first few weeks here at Cloudera, I’ve been tasked with helping out with the Apache ZooKeeper system, part of the umbrella Hadoop project. ZooKeeper is a system for coordinating distributed processes. In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem. For example, simply having a set of processes wait until they’ve all reached the same point in their execution –
This piece is based on the talk Practical MapReduce that I gave at Hadoop User Group UK on April 14.
1. Use an appropriate MapReduce language
There are many languages and frameworks that sit on top of MapReduce, so its worth thinking up-front which one to use for a particular problem. There is no one-size-fits-all language; each has different strengths and weaknesses.
- Java: Good for: speed;
Last Tuesday – on my second day of work at Cloudera – I went to London to check out the second UK Hadoop User Group meetup, kindly hosted by Sun in a nice meeting room not far from the river Thames. We saw a day of talks from people heavily involved with Hadoop, both on the development and usage side and more often than not a bit of both. It was a great opportunity to put a selection of people all interested in Hadoop technology in the same room and find out what the current status and future directions of the project are.
Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topic is anything to go by. In this post I’ll look at the problem, and examine some common solutions.
Problems with small files and HDFS
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files,
It’s a new year, the time when we take a moment to look back at the previous one, and forward to what might be coming next. In the world of Hadoop a lot happened in 2008.
At the beginning of the year, Hadoop was a sub-project of Lucene. In January, Hadoop became a Top Level Project at Apache, in recognition of its success and diversity of community. This allowed sub-projects to be added,