Last Tuesday – on my second day of work at Cloudera – I went to London to check out the second UK Hadoop User Group meetup, kindly hosted by Sun in a nice meeting room not far from the river Thames. We saw a day of talks from people heavily involved with Hadoop, both on the development and usage side and more often than not a bit of both. It was a great opportunity to put a selection of people all interested in Hadoop technology in the same room and find out what the current status and future directions of the project are.
There were around 55 attendees from a variety of organisations, both academic and professional. Tom White and I were there representing Cloudera, and there were attendees from Microsoft, HP, the Apache Software Foundation and the incredibly fashionable guys from Last.fm.
The slides and talks have been made available by the organisers here – they’re well worth checking out if you want to get a cross-section of some current activity around Hadoop. I’ve written up some notes on the talks which you can find below.
If you’re actively building applications with Hadoop, then Tom’s Practical MapReduce slide deck should be required reading. Tom gave ten invaluable tips for building high-quality MapReduce applications. For example, it’s no longer as necessary to defray job start-up cost by squeezing all your processing into a single job. Improvements to schedulers (see Matei Zaharia’s posts on the new schedulers and upcoming improvements) have made it more practical to structure your computations in manageable, job-sized units, and planned work on the overhead of a job should improve the situation yet further. Another tip is to implement the
Tool interface. A job that implements
Tool can offload a number of tedious configuration tasks to the framework, such as the treatment of generic command-line parameters. This frees up developer time for more important tasks, as well as ensuring uniform treatment of common functionality.
Learning From Data – Apache Mahout
Isabel Drost from ASF spoke about Apache Mahout, a toolbox of machine learning techniques and algorithms that make use of Hadoop. Machine learning is becoming hugely important to technology, and drives some of the coolest applications from statistical machine translation to automatic recommendation algorithms. The ML community have long understood that a most effective way to tailor generic learning algorithms is to train them on huge amounts of data. This can generally be done in parallel, and so Hadoop is clearly a great fit. Mahout – which is an Indian word roughly meaning ‘elephant driver’ – implements well-known algorithms for clustering, classification and evolutionary programming.
Mahout had a 0.1 release only recently, and has been making great strides in the year since the project was started. One of the strengths of Mahout is that there is an attendant community of experts in machine learning who are willing to help you get started building applications making use of these powerful techniques. Isabel mentioned that they’re looking for contributors and volunteers – if you want to help out, head over to the Apache Mahout project page and see how you can get involved.
Indexing Massive Datasets
Craig McDonald from the University of Glasgow spoke about the Terrier project. Terrier is a platform that supports research into information retrieval, which includes areas like search and natural language question answering. In particular, Craig told us about his experiences building an indexing platform that could handle pretty large data sets – the latest corpus that they’re intending to process is 25 TB and over a billion documents – using Hadoop. Terrier sees linear speedup in the time taken to index as the number of nodes in their cluster increases. This is a good example of how Hadoop scales to large amounts of data as required. In order to realise this speedup, Craig told us they had to correct a mistake they made by only using one reducer – this added a significant serial component to their computation and, as students of Amdahl’s law will know, the ability of a computation to scale with more parallel resources is limited by the ratio of the serial part of the computation to the parallel part. Thankfully, the number of reducers isn’t a fundamental limitation of the Hadoop framework, and a bit of reconfiguration to use multiple reducers gave some impressive speedup figures.
MapReduce and Matrix Maths
In the afternoon, Paolo Castagna of HP Labs walked us through an implementation of the PageRank algorithm in MapReduce. PageRank is the algorithm used by Google to determine how ‘important’ web pages are relative to one another. Essentially, it involves simulating the actions of an itinerant web surfer who follows links at random for ever. This gives rise to a probability distribution over all web pages, and the more ‘likely’ pages are considered more important. Computing this distribution via simulation requires storing the entire graph of web page links in a massive, sparse matrix and then running the simulation over all pages in parallel. Again, this is the sort of thing that Hadoop is incredibly good at, and Paolo showed us how simple an efficient implementation of such a seemingly complex algorithm could be once you’ve got MapReduce to do the heavy lifting for you.
Michael Stack from PowerSet rounded off the day with an overview of HBase, an open-source implementation of Google’s BigTable. Michael’s slides give a good introduction to the project, and gave some indication of the way things are planned to progress for the next couple of releases. Performance is becoming very important to HBase, as there are users who want to start serving large web sites from an HBase store. This is one of the focus points for 0.20, along with moving control plane functionality into ZooKeeper, the reliable coordination service. Multi-row transactions are coming as well, which will help those who currently have to store all data that must only be accessed under a lock in a single row.
The question-and-answer sessions, as well as the informal chats between sessions, gave rise to some lively and interesting debate. It’s clear that some confusion remains over the relative merits of RDBMs systems versus MapReduce-based technologies. At Cloudera we’ve put a lot of effort into our training courses that help elucidate the different problem domains that each approach is suited to, and the onus is on the community to continue to articulate that distinction. It’s evident from the presentations, and the buzz around the meetup, that Hadoop is successfully being used to do a lot of serious data processing in real projects, and that people are excited about where the technology is going to go from here.
Thanks to Johan for organising the day, and to Sun for hosting us and (eventually) finding the pizza for lunch :)