Cloudera Blog · Data Collection Posts
(guest blog post by Dmitriy Ryaboy)
A number of organizations donate server space and bandwidth to the Apache Foundation; when you download Apache Hadoop, Tomcat, Maven, CouchDB, or any of the other great Apache projects, the bits are sent to you from a large list of mirrors. One of the ways in which Cloudera supports the open source community is to host such a mirror.
In this blog post, we will use Pig to examine the download logs recorded on our server, demonstrating several features that are often glossed over in introductory Pig tutorials—parameter substitution in PigLatin scripts, Pig Streaming, and the use of custom loaders and user-defined functions (UDFs). It’s worth mentioning here that, as of last week, the Cloudera Distribution for Hadoop includes a package for Pig version 0.2 for both Red Hat and Ubuntu, as promised in an earlier post. It’s as simple as
apt-get install pig or
yum install hadoop-pig.
In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
Update (added 5/15/2013): The information below is a bit dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use
vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
Typically, when you’re developing Map-Reduce applications, you simply point Eclipse at the Apache Hadoop
jar file, and you’re good to go. (Cloudera’s Hadoop training VM has a fully-configured example.) However, when you want to dig deeper to explore—and modify—Hadoop’s internals themselves, you’ll want to configure Eclipse to build Hadoop. Because there’s generated code and a complicated
build.xml file, this takes some tinkering. Now that I have the full Hadoop Eclipse experience going (it took me a few tries), I’ve prepared a screencast that will help guide you through it, from downloading Eclipse to debugging one of its unit tests. You’ll also want to reference the EclipseEnvironment Hadoop wiki page, which has more details.
We’ve been talking to enterprise users of Hadoop about existing and new projects, and lots of them are asking questions about reliability and data integrity. So we wrote up a short paper entitled HDFS Reliability to summarize the state of the art and provide advice. We’d like to get your feedback, too, so please leave a comment.
As promised in my post about installing Scribe for log collection, I’m going to cover how to configure and use Scribe for the purpose of collecting Hadoop logs. In this post I’ll describe how to create the Scribe Thrift client for use in Java, add a new log4j Appender to Hadoop, configure Scribe, and collect logs from each node in a Hadoop cluster. At the end of the post, I will link to all source and configuration files mentioned in this guide.
What’s the advantage of collecting Hadoop’s logs?
Collecting Hadoop’s logs in to one single log file is very convenient for debugging and monitoring purposes. Essentially, what Scribe lets you do is tail -f your entire cluster, which is really cool to see :). Having one single log file on one, or even just a few, machines makes log analysis insanely simpler. Making log analysis simpler means the creation of monitoring and reporting tools just got a lot easier as well. Let’s get started …
Create the Scribe Thrift Client for Java
In order to use Scribe with Java, you need to use Thrift to create a Scribe client stub, which is a collection of generated Java files to be used by our custom log4j Appender, which will be covered later. For now, cd in to your Scribe distribution directory (I used trunk), and then cd in to the ‘if’ directory. Now you’re in $SCRIBE_DIST/if. Edit scribe.thrift by changing the ‘include’ line that includes fb303.thrift to the following:
Scribe is a newly released log collection tool that dumps log files from various nodes in a cluster to Scribe servers, where the logs are stored for further use. Facebook describes their usage of Scribe by saying, “[Scribe] runs on thousands of machines and reliably delivers tens of billions of messages a day.” It turns out that Scribe is rather difficult to install, so the hope of this post is to help those of you attempting to install Scribe. The first step is to get dependencies installed.
Scribe has many dependencies that must be installed in order for Scribe to be built properly. They are listed here: