For the last few months, we’ve been working with the TVA to help them manage hundreds of TB of data from America’s power grids. As the Obama administration investigates ways to improve our energy infrastructure, the TVA is doing everything they can to keep up with the volumes of data generated by the “smart grid.” But as you know, storing that data is only half the battle. In this guest blog post, the TVA’s Josh Patterson goes into detail about how Hadoop enables them to conduct deeper analysis over larger data sets at considerably lower costs than existing solutions. Read more
In my first few weeks here at Cloudera, I’ve been tasked with helping out with the Apache ZooKeeper system, part of the umbrella Hadoop project. ZooKeeper is a system for coordinating distributed processes. In a distributed environment, getting processes to act in any kind of synchrony is an extremely hard problem. For example, simply having a set of processes wait until they’ve all reached the same point in their execution –
Last Tuesday – on my second day of work at Cloudera – I went to London to check out the second UK Hadoop User Group meetup, kindly hosted by Sun in a nice meeting room not far from the river Thames. We saw a day of talks from people heavily involved with Hadoop, both on the development and usage side and more often than not a bit of both. It was a great opportunity to put a selection of people all interested in Hadoop technology in the same room and find out what the current status and future directions of the project are.
Update (added 5/15/2013): The information below is dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
Typically, when you’re developing Map-Reduce applications,
Small files are a big problem in Hadoop — or, at least, they are if the number of questions on the user list on this topic is anything to go by. In this post I’ll look at the problem, and examine some common solutions.
Problems with small files and HDFS
A small file is one which is significantly smaller than the HDFS block size (default 64MB). If you’re storing small files,