Cloudera Blog · Hadoop Posts

Understanding MapReduce via Boggle, Part 2: Performance Optimization

In Part 1 of this series, you learned about MapReduce’s ability to process graphs via the example of Boggle*. The project’s full source code can be found on my GitHub account.

The example comprised a 4×4 matrix of letters, which doesn’t come close to the number of relationships in a large graph. To calculate the number of possible combinations, we turned off the Bloom Filter with “-D bloom=false“. This enters a brute-force mode where all possible combinations in the graph are traversed. In a 4×4 or 16-letter matrix, there are 11,686,456 combinations, and a 5×5 or 25-letter matrix has 9,810,468,798 combinations.

As previously discussed, increasing matrix sizes is an important part of scaling up. We also want to effectively use the cluster when processing the graph. In this post, I’ll describe some of the performance optimizations I used to improve performance and scalability.

Webinar: Introduction to Hadoop Developer Training (Jan. 31)

Are you new to Apache Hadoop and need to start processing data fast and effectively? Have you been playing with CDH and are ready to move on to development supporting a technical or business use case? Are you prepared to unlock the full potential of all your data by building and deploying powerful Hadoop-based applications?

Save 15% on Multi-Course Public Training Enrollments in January and February

Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production.  We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.

But did you know Cloudera training can also help you plan for the advanced stages and progress of your Hadoop cluster? In addition to core training for Developers and Administrators, we also offer the best (and, in some cases, only) opportunity to get up to speed on lifecycle projects within the Hadoop ecosystem in a classroom setting. Cloudera University’s course offerings go beyond the basics to include Training for Apache HBase, Training for Apache Hive and Pig, and Introduction to Data Science: Building Recommender Systems. Depending on your Big Data agenda, Cloudera training can help you increase the accessibility and queryability of your data, push your data performance towards real-time, conduct business-critical analyses using familiar scripting languages, build new applications and customer-facing products, and conduct data experiments to improve your overall productivity and profitability.

For a limited time, Cloudera University is offering a 15% discount when you register for two or more Hadoop training courses to help you build out and realize your Big Data plan. Cover the basics with Developer or Administrator training, move beyond the HDFS and MapReduce core by pairing Developer and HBase training, work towards machine learning with Hive and Pig training and Introduction to Data Science, or customize your own learning path.  Just use discount code 15off2 when you register for multiple public training classes from Cloudera University. This offer is only available for new enrollments and is only valid for classes delivered by Cloudera and scheduled to begin before March 1, 2013.

Meet the Instructor: Jesse Anderson

Jesse Anderson The Hadoop Community is an invariably fascinating world.  After all, as Clouderan ATM put it in a past blog post, the user group meetups are adorably called “HUGs.” Just as the Cloudera blog has introduced you to some of the engineers, projects, and applications that serve as the head, heart, and hands of the Hadoop Community, we’re proud to add the circulatory system (to extend the metaphor), made up of Cloudera’s expert trainers and curriculum developers who bring Hadoop to new practitioners around the world every week.

Welcome to the first installment of our “Meet the Instructor” series, in which we briefly introduce you to some of the individuals endeavoring to teach Hadoop far and wide. Today, we speak to Jesse Anderson (@jessetanderson)! 

What is your role at Cloudera?
I joined Cloudera about a year ago as a curriculum developer and instructor. I get the best of both worlds in educational services: I create and improve existing curriculum, such as the Cloudera Manager series, and I travel to teach the courses.

Understanding MapReduce via Boggle

Graph theory is a growing part of Big Data. Using graph theory, we can find relationships in networks. 

MapReduce is a great platform for traversing graphs. Therefore, one can leverage the power of an Apache Hadoop cluster to efficiently run an algorithm on the graph.

One such graph problem is playing Boggle*. Boggle is played by rolling a group of 16 dice. Each players’ job is find the most number of words spelled out by the dice. These dice are six-sided with a single letter that faces up:

Apache Hadoop in 2013: The State of the Platform

For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.

In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.

Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.

Apache Hadoop Releases

Data Hacking Day with Cloudera (Feb. 25, Palo Alto)

(Update 2/6/2013 – Sorry, this event is sold out!)

With Strata Conference 2013 coming to town (Feb. 26-28, in Santa Clara, Calif.), we thought it would be a great opportunity to open our Palo Alto office’s doors for a pre-conference “Data Hacking Day” on Monday, Feb. 25!

Participants will use Cloudera Impala, the open-source, real-time query engine for Apache Hadoop, to hack on a rich public data set. After forming teams, you’ll compete to see whose project will earn enough votes to win the data-hacking trophy for the day. All members of the winning team will get free hard copies of Eric Sammer’s coveted O’Reilly book, Hadoop Operations.

Get a Free Hadoop Operations Ebook with Administrator Training

Start the year off with bigger questions by taking advantage of Cloudera University’s special offer for aspiring Hadoop administrators. All participants who complete a Cloudera Administrator Training for Apache Hadoop public course by the end of March 2013 will receive a free digital copy of Hadoop Operations by Eric Sammer. If you’ve been asked to maintain large and complex Hadoop clusters, this book is a must. In addition to providing practical guidance from an expert, Hadoop Operations is also a terrific companion reference to the full Cloudera Administrator course.

Cloudera’s three-day course provides administrators a comprehensive understanding of all the steps necessary to operate and manage Hadoop clusters. From installation and configuration through load balancing and tuning your cluster, Cloudera’s administration course has you covered. This course is appropriate for system administrators and others who will be setting up or maintaining a Hadoop cluster. Basic Linux experience is a prerequisite, but prior knowledge of Hadoop is not required.

Upon completion of the course, attendees also receive a voucher for a Cloudera Certified Administrator for Apache Hadoop (CCAH) exam. Certification is a great differentiator; it helps establish individuals as leaders in their field, providing customers with tangible evidence of skills and expertise.

A Guide to Python Frameworks for Hadoop

I recently joined Cloudera after working in computational biology/genomics for close to a decade. My analytical work is primarily performed in Python, along with its fantastic scientific stack. It was quite jarring to find out that the Apache Hadoop ecosystem is primarily written in/for Java. So my first order of business was to investigate some of the options that exist for working with Hadoop from Python.

In this post, I will provide an unscientific, ad hoc review of my experiences with some of the Python frameworks that exist for working with Hadoop, including:

What’s Next for Cloudera Impala?

It’s been an exciting month and a half since the launch of the Cloudera Impala (the new open source distributed query engine for Apache Hadoop) beta, and we thought it’d be a great time to provide an update about what’s next for the project – including our product roadmap, release schedule and open-source plan.

First of all, we’d like to thank you for your enthusiasm and valuable beta feedback. We’re actively listening and have already fixed many of the bugs reported, captured feature requests for the roadmap, and updated the Cloudera Impala FAQ based on user input.

GA Roadmap

Our primary focus between now and general availability (GA) is making Impala enterprise-ready for your production Hadoop clusters. This means continued investments in product stability as well as product functionality, including:

Newer Posts Older Posts