Around the world, individuals contribute to Hadoop and build community around the technology. This kind of collaboration is at the heart of open source software, and here at Cloudera, we feel privileged to be a part of the Apache Hadoop community.
Getting together in person is a great way to build community. On global projects, though, sharing information from those gatherings with people who are far way is a big challenge. Recently, Isabel Drost from Berlin approached us about helping out with their local get-together, and we were more than happy to pitch in to sponsor the video production, and to help get the word out on our blog. If you run a local Hadoop meetup and would like to share your event with the world-wide community, please let us know! Following is Isabel’s write up of the event, along with the videos. – Christophe
Recently, the Apache Hadoop Get-Together took place in Berlin Mitte. For the sixth time, developers interested in large-scale information processing met in newthinking store: Over the course of one year the event grew from a small, spontaneous gathering of Hadoop enthusiasts to a meetup of 40 people from Germany, France and Denmark. Developers from various companies like Nokia Gate5, StudiVZ and others participated, as well as several freelancers specialized in providing support for Hadoop and Lucene. Students and researchers from local universities also joined the event.
The meetup was kindly hosted by newthinking store, an event management / IT services company which provides rooms for free software meetings at no cost. The Get-Together was sponsored by Cloudera (video recordings) and O’Reilly (books). Thanks to all three of you.
The Get-Together started late afternoon at 5p.m. with a talk on solving puzzles with MapReduce by Thorsten Schütt. After that Thilo Götz gave an introduction to JAQL. Finally Uwe Schindler explained the improvements that come with Lucene 2.9. After the official part we moved over to a bar close by for some food, drinks and (non-free) beer.
A brief summary of each talk can be found below. The slides of the talks have been put online already. Videos will be available early next week. Notifications of future meetups and related events in Germany are announced on a public mailing list. Feel free to subscribe to stay up to date on Hadoop Berlin meetups.
Solving puzzles with Map Reduce
Thorsten Schütt gave a presentation on solving sliding puzzles with a MapReduce implemention. Thorsten is a researcher at Zuse Institute Berlin. Zuse Institute has been using high performance compute clusters working on scalable algorithms for decades.
The talk did not focus on Hadoop in particular, but on applying MapReduce (the paradigm) to solve sliding puzzles. For HPC clusters Hadoop is not the best choice to implement distributed algorithms – developing for these clusters rather involves writing software in Fortran or C/C++. MPI provides the parallelization framework for distributed programs.
In his presentation Thorsten explained the way he had implemented and optimized a breadth-first search algorithm to efficiently solve a 4×4 sliding puzzle in a reasonable amount of time. If you are interested in all the details, proofs and concepts have a look at his HPCS paper “Out-of-Core Parallel Heuristic Search with MapReduce.”
I met Thorsten at the “Lange Nacht der Wissenschaften”, a recurring event where Berlin’s Universities open up for one night and present their fields of study to everybody. On these evenings, you can join presentations, take guided tours, and meet researchers. If you would like to take a look at the ZIB datacenters yourself, you might want to join this special night next year in summer and take the tour through the ZIB basement.
An introduction to JAQL
Thilo Götz introduced JAQL. JAQL is a higher level query language for JSON documents. It was developed at IBM’s Almaden research center. JAQL supports several operations generally known from SQL. It has support for grouping results, joining arrays on a common attribute, sorting and expansion. It also has built-in support for loops, conditionals and recursion.
The language can be easily extended by custom Java methods. And the great thing about JAQL: The resulting scripts can be compiled to Hadoop MapReduce jobs. That way developers do not need to know all the gory details of Hadoop MapReduce, but can still get at those if the need arises.
JAQL supports various I/O options: JSON data can be read from local disk, HDFS and HBase tables. If that does not fit your needs, there are easy interfaces to implement your own I/O adapters.
Lucene 2.9 developments
The last talk was given by Uwe Schindler. He gave an overview of the optimizations and new features that come with the recently released Lucene 2.9.
Lucene 2.9 comes with a highly optimized implementation of range queries and filters. Uwe gave a demonstration of its performance by doing a range search in the geographical search engine Pangaea.
A second large improvement is per-segment searching. Lucene indexes are split into segments that are written incrementally and merged during optimization. The index searcher now works directly on segments, and results are merged by collectors. As a result, FieldCaches to also work on segments as well. That way only the caches for changed segments need to be invalidated and recreated.
Lucene now comes with near real-time search that permits low latency between indexing documents and that is able to retrieve documents through searches. The refactored TokenStream API supports adding attributes to terms. That way arbitrary information can be added to tokens during indexing time. A use case for this feature is adding POS tags to tokens that can be used at later analysis steps.
The general feedback from attendees was very positive: Getting developers and current and future users together at an informal meetup clearly fosters exchanging experience and ideas. Judging from the presentations and discussions at the Get-Together, people are starting to use Hadoop for a variety of processing and data mining tasks.
The next Get-Together is scheduled to take place on December 16th. The date was set by the first presenter at the December meetup, Jörg Möllenkamp from Sun. Tuesday late evening, nurago from Hannover offered to submit a talk on their experiences with Hadoop. In addition StudiVZ offered to support the event by sponsoring video production. If you would like to submit a talk yourself or sponsor free beer for all attendees, please contact me at firstname.lastname@example.org
If you just cannot wait until December, the first NoSQL Meetup in Germany is scheduled for mid-October and will be hosted by newthinking store. For those of you who need an excuse to travel to California, Apache Con US features trainings, meetups and a lot of presentations on Lucene, Solr and Hadoop. Looking forward to seeing you in Oakland!