As part of our series of announcements at the recent Hadoop Summit, Cloudera released two of its previously internal projects into open source. One of those was the HUE user interface environment, which we’ll be saying a bit more about later this week. The other was our data movement platform Flume. We’ve been working on Flume for many months, and it’s really exciting to be able to share the details of what we’ve been doing.
The ‘CAP’ theorem is a hot topic in the design of distributed data storage systems. However, it’s often widely misused. In this post I hope to highlight why the common ‘consistency, availability and partition tolerance: pick two’ formulation is inadequate for distributed systems. In fact, the lesson of the theorem is that the choice is almost always between sequential consistency and high availability.
It’s very common to invoke the ‘CAP theorem’
In my first few weeks here at Cloudera, I’ve been tasked with helping out with the Apache ZooKeeper system, part of the umbrella Hadoop project. ZooKeeper is a system for coordinating distributed processes. In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem. For example, simply having a set of processes wait until they’ve all reached the same point in their execution –
Last Tuesday – on my second day of work at Cloudera – I went to London to check out the second UK Hadoop User Group meetup, kindly hosted by Sun in a nice meeting room not far from the river Thames. We saw a day of talks from people heavily involved with Hadoop, both on the development and usage side and more often than not a bit of both. It was a great opportunity to put a selection of people all interested in Hadoop technology in the same room and find out what the current status and future directions of the project are.