Cloudera Engineering Blog · Distribution Posts
In addition to providing you with a dependable release of Hadoop that is easy to configure, at Cloudera we also focus on developing tools to extend Hadoop’s usability, and make Hadoop a more central component of your data infrastructure. In this vein, we’re proud to announce the availability of Sqoop, a tool designed to easily import information from SQL databases into your Hadoop cluster.
Sqoop (“SQL-to-Hadoop”) is a straightforward command-line tool with the following capabilities:
A few months ago we announced the Cloudera Distribution for Hadoop. We’re happy to report that lots of people have started using our distribution, and our GetSatisfaction product (which is essentially a message board about our products) has seen lots of good Hadoop questions and answers. We thought it would be worthwhile to share some of the interesting questions and requests we’ve seen from our users.
Question: How do I backup my name node metadata?
(Editor’s note: The information in this section pertains to CDH3 only. For CDH4, refer to the CDH4 High Availability Guide.) The name node (NN) stores all of the HDFS metadata, which includes file names, directory structures, and block locations. This metadata is stored in memory for fast lookup, but the NN also maintains two on-disk data structures to ensure that metadata is persisted. The first structure stored is a snapshot of the in-memory metadata, and the second structure stored is an edit log of changes that have been made since the snapshot was last taken. The secondary name node (2NN) is in charge of fetching the snapshot and edit log from the NN and merging the two into a new snapshot, which is then sent back to the NN. Once the NN gets the new snapshot, it clears its edit log, and the process repeats. Take a look at our other blog post about multi-host secondary name nodes for more information about configuring the 2NN.
When we announced Cloudera’s Distribution for Apache Hadoop last month, we asked the community to give us feedback on what features they liked best and what new development was most important to them. Almost immediately, Debian and Ubuntu packages for Hadoop emerged as the most popular request. A lot of customers prefer Debian derivatives over Red Hat, and installing RPMs on top of Debian, while possible with tools like alien, is a pain to say the least.
After some weeks of development and testing, we are happy to announce the Cloudera APT Repository. APT is the standard package distribution mechanism for Ubuntu and Debian, and by simply pointing your machines at our repository, you can have Hadoop installed within minutes.
Update (added 5/15/2013): The information below is dated; see this post for current instructions about configuring Eclipse for Hadoop contributions.
One of the perks of using Java is the availability of functional, cross-platform IDEs. I use
vim for my daily editing needs, but when it comes to navigating, debugging, and coding large Java projects, I fire up Eclipse.
A few weeks ago we announced Cloudera’s Distribution for Apache Hadoop, and I want to spend some time showing how our distribution makes a sysadmin’s job a little easier.
Perhaps the most useful features in our distribution, at least for sysadmins, are RPM packages and init scripts. RPMs are the standard way of installing software on a Red Hat Linux distribution (RHEL, Fedora Core, CentOS). They give sysadmins a one-command install, and they install libraries, binaries, init scripts, log files, man pages, and configuration files in places where Linux users expect them, typically /usr/lib, /usr/bin, /etc/init.d, /var/log, /usr/share/man, and /etc, respectively. RPMs are also very easy to uninstall and upgrade.