In anticipation of Hadoop World 2010 in New York October 12th, we continue our Q&A series with Hadoop World presenters to provide a taste of what attendees can expect. Were excited about the 36 presentations that are planned (see agenda) including talks from eBay, Twitter, GE, Facebook, Digg, HP and more. Tim OReilly, founder of OReilly Media is keynoting, which should be inspiring as well as thought provoking. Everyone who registers for Hadoop World will receive a free copy of the second edition of Tom Whites Hadoop: The Definitive Guide.
Hadoop World 2010 presenter Saptarshi Guha works in the Department of Statistics at Purdue University. His presentation for Hadoop World is titled Using R and Hadoop to Analyze VoIP Network Data for QoS. Guha has been developing with Hadoop and R for over a year.
Q: What can attendees expect learn about Hadoop from your presentation at Hadoop World?
The quality of VoIP calls are suspect to the queuing effects introduced by the network gateways. The jitter between two consecutive packets is the deviation of the real inter-arrival time from theoretical.
We use the R environment for the statistical analysis of data to show jitter follows desired properties and is negligible, which demonstrates that the measured traffic is close to the offered traffic. Data sets used to study the departure from offered load can be massive and require detailed study of several complex data structures. Using an environment that integrates R and Hadoop, we hope to demonstrate the effectiveness of R and Hadoop for the comprehensive statistical analyses of massive data sets.
Q: Describe use cases for Hadoop at Purdue.
Our team works with large amounts of network traffic data collected for VoIP and network security projects. Our language of analysis is almost exclusively R and we need a way to store the 190 gigabytes of VoIP related data, create data structures for analysis and compute across these. The R and Hadoop combination allows us to do all of this in a manner that scales with the size of the data and returns results within acceptable time frames. Despite not having HBase installed, we use Hadoop map files and R to query data structures from a database of 14 million objects spanning 21GB within seconds.
Q: What benefits do you see from Hadoop?
The biggest win is the reduction in computing time, the ease of programming in the R and Hadoop environment and the Hadoop Distributed Filesystem. We have stopped worrying about disk space and freely store as many databases of objects as required. It must be mentioned, that Hadoop DFS and MapReduce are both very easy to setup and return very impressive results. For our approach to analysis, the Hadoop MapReduce paradigm fits very well. We partition the data into many subsets (usually by the levels of categorical variables), compute across these and recombine the results. We also visualize a subset of these and recombine the results in to multi panel multi page displays, which are viewed across large 30″ monitors.
Q: What did you use before Hadoop?
Some of the things we have done were impossible without Hadoop. Before this we used a tree hierarchy of directories of flat files containing R objects and index files to locate objects with in these flat files. Distributing computation across our cluster was a laborious, manual and very project specific affair but now using the R and Hadoop system the we have sufficiently abstracted the workflow to span a multitude of data sets.
Q: How has Hadoop improved your work at Purdue?
Hadoop has certainly improved our workflow, allowing the researchers to think about studying the data rather than how to distribute code and data, how to maintain a cluster, or how to tackle tedious but vital things such as computer failure. Because the time to compute is substantially less the researchers have the flexibility to implement their ideas and interactively analyze the data. We hope to increase our cluster size and bring more people into the fold.
Q: What are you hoping to get out of your time at Hadoop World?
To demonstrate that it is indeed possible to comprehensively analyze gigabytes of data with a level of detail that was only possible with small data sets and to learn of new Hadoop related technologies that might benefit our workflow.