Cloudera Developer Blog · MapReduce Posts
The following guest post comes from Alejandro Caceres, president and CTO of Hyperion Gray LLC – a small research and development shop focusing on open-source software for cyber security.
Imagine this: You’re an informed citizen, active in local politics, and you decide you want to support your favorite local political candidate. You go to his or her new website and make a donation, providing your bank account information, name, address, and telephone number. Later, you find out that the website was hacked and your bank account and personal information stolen. You’re angry that your information wasn’t better protected — but at whom should your anger be directed?
Who is responsible for the generally weak condition of website security, today? It can’t be website operators, because there’s no prerequisite to know about blind SQL injection attacks or validation filters before spinning up a website. It can’t be website developers either — we definitely don’t equip them to evaluate website security for themselves. It’s a pretty small community that focuses on web development and web security, and that community is pretty opaque.
The following guest post is provided by Aaron Kimball, CTO of WibiData.
The Kiji ecosystem has grown with the addition of a new module, KijiMR. The Kiji framework is a collection of components that offer developers a handle on building Big Data Applications. In addition to the first release, KijiSchema, we are now proud to announce the availability of a second component: KijiMR. KijiMR allows KijiSchema users to use MapReduce techniques including machine-learning algorithms and complex analytics to develop many kinds of applications using data in KijiSchema. Read on to learn more about the major features included in KijiMR and how you can use them.
KijiMR offers developers a set of new processing primitives explicitly designed for interacting with complex table-oriented data. The low-level batch interfaces available in MapReduce include basic InputFormat and OutputFormat implementations. The raw APIs are designed for processing key-value pairs stored in flat files in HDFS. Integrating MapReduce with HBase via InputFormat and OutputFormat APIs is hard to do from scratch in every algorithm. In KijiMR, we have extended the available MapReduce APIs to include:
Last week the Apache Hadoop PMC voted to release Apache Hadoop 2.0.3-alpha, the latest in the Hadoop 2 release series. This release fixes over 500 issues (covering the Common, HDFS, MapReduce and YARN sub-projects) since the 2.0.2-alpha release in October last year. In addition to bug fixes and general improvements the more noteworthy changes include:
In my previous post, you learned how to write a basic MapReduce job and run it on Apache Hadoop. In this post, we’ll delve deeper into MapReduce programming and cover some of the framework’s more advanced features. In particular, we’ll explore:
The following is a series of stories from people who in the recent past worked as Engineering Interns at Cloudera. These experiences concretely illustrate how collaboration between commercial companies like Cloudera and academia, such as in the form of these internships, helps promote big data research at universities. (These experiences were previously published in the ACM student journal, XRDS.)
Yanpei Chen (Intern 2011)
I Interned with Cloudera during my last summer of grad school. My dissertation was on “Workload Driven Design and Evaluation of Large-Scale Data-Centric Systems”, and I already had collaborations with Facebook and NetApp, two other big data companies. The goal of my work was to develop and demonstrate a set of empirical, workload-driven design and evaluation methods that complemented the traditional, subjective approach of designing by intuition and experience. It was very important that these methods generalized across many types of customer workloads. Hence, when Cloudera offered me an internship, I leapt at the unique opportunity to collect insights from customers in traditional industries who were still dealing with big data.
The example comprised a 4×4 matrix of letters, which doesn’t come close to the number of relationships in a large graph. To calculate the number of possible combinations, we turned off the Bloom Filter with “
-D bloom=false“. This enters a brute-force mode where all possible combinations in the graph are traversed. In a 4×4 or 16-letter matrix, there are 11,686,456 combinations, and a 5×5 or 25-letter matrix has 9,810,468,798 combinations.
As previously discussed, increasing matrix sizes is an important part of scaling up. We also want to effectively use the cluster when processing the graph. In this post, I’ll describe some of the performance optimizations I used to improve performance and scalability.
Graph theory is a growing part of Big Data. Using graph theory, we can find relationships in networks.
MapReduce is a great platform for traversing graphs. Therefore, one can leverage the power of an Apache Hadoop cluster to efficiently run an algorithm on the graph.
One such graph problem is playing Boggle*. Boggle is played by rolling a group of 16 dice. Each players’ job is find the most number of words spelled out by the dice. These dice are six-sided with a single letter that faces up:
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Let’s look at the highlights of the 2012 developments around projects supported by Cloudera.
Apache Hadoop Releases
As Apache Oozie, the workflow engine for Apache Hadoop, continues to receive wider adoption from our customers and the community, we’re seeing patterns with respect to the biggest challenges for users. One such point of difficulty is setting up and using Oozie’s ShareLib for allowing JARs to be shared by different workflows. This blog post is intended to help you with those tasks.
A missing or improperly installed ShareLib will cause some action types (DistCp, Streaming, Pig, Sqoop, and Hive) to fail. In this case, you’ll typically see any of the following exceptions in the Oozie and JobTracker logs:
java.lang.ClassNotFoundException: org.apache.hadoop.tools.DistCp java.lang.NoClassDefFoundError: org/apache/pig/Main java.lang.ClassNotFoundException: org.apache.pig.Main java.lang.NoClassDefFoundError: org/apache/sqoop/Sqoop java.lang.ClassNotFoundException: org.apache.sqoop.Sqoop java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver java.lang.ClassNotFoundException: Class org.apache.hadoop.streaming.PipeMapRunner not found
This is the first post in series that will get you going on how to write, compile, and run a simple MapReduce job on Apache Hadoop. The full code, along with tests, is available at http://github.com/cloudera/mapreduce-tutorial. The program will run on either MR1 or MR2.
We’ll assume that you have a running Hadoop installation, either locally or on a cluster, and your environment is set up correctly so that typing “hadoop” into your command line gives you some notes on usage. Detailed instructions for installing CDH, Cloudera’s open-source, enterprise-ready distro of Hadoop and related projects, are available here: https://ccp.cloudera.com/display/CDH4DOC/CDH4+Installation. We’ll also assume you have Maven installed on your system, as this will make compiling your code easier. Note that Maven is not a strict dependency; we could also compile using Java on the command line or with an IDE like Eclipse.
The Use Case
There’s been a lot of brawling on our pirate ship recently. Not so rarely, one of the mates will punch another one in the mouth, knocking a tooth out onto the deck. Our poor sailors will wake up the next day with an empty bottle of rum, wondering who’s responsible for the gap between their teeth. All this violence has gotten out of hand, so as a deterrent, we’d like to provide everyone with a list of everyone that’s ever left them with a gap. Luckily, we’ve been able to set up a Flume source so that every time someone punches someone else, it gets written out as a line in a big log file in Hadoop. To turn this data into these lists, we need a MapReduce job that can 1) invert the mapping from attacker to their victim, 2) group by victims, and 3) eliminate duplicates.