Cloudera’s Distribution for Apache Hadoop: Making Hadoop Easier for a Sysadmin
A few weeks ago we announced Cloudera’s Distribution for Apache Hadoop, and I want to spend some time showing how our distribution makes a sysadmin’s job a little easier.
Perhaps the most useful features in our distribution, at least for sysadmins, are RPM packages and init scripts. RPMs are the standard way of installing software on a Red Hat Linux distribution (RHEL, Fedora Core, CentOS). They give sysadmins a one-command install, and they install libraries, binaries, init scripts, log files, man pages, and configuration files in places where Linux users expect them, typically /usr/lib, /usr/bin, /etc/init.d, /var/log, /usr/share/man, and /etc, respectively. RPMs are also very easy to uninstall and upgrade.
Init scripts are the standard way to start, stop, and restart daemon processes on a Linux system. They allow sysadmins to start and stop daemons with the /sbin/service script, and they use a standard parameter interface, namely start, stop, or restart (e.g., sudo /sbin/service hadoop-datanode start). Init scripts also make sure that the daemon runs as the correct user, which in Hadoop’s case is the hadoop user. Lastly, init scripts are used to start daemons at boot time, allowing daemons to survive reboots.
Perhaps I’ve convinced you that our distribution for Hadoop is easier to install. Simplifying the installation process for a single machine doesn’t make a huge impact. However, when you’re deploying Hadoop on many machines, and when you’re using configuration management tools such as Puppet, Bcfg2, Chef, Cfengine, etc., RPMs and init scripts make sysadmin work slightly better. I’ve written two Puppet implementations to demonstrate this point: the first installs Hadoop from a tarball; the second installs from a RPM. For those of you familiar with Puppet, you’ll notice that the RPM installation is more Puppet-friendly and generally less complicated. If you’d like to learn a little more about Puppet, the type reference and getting started docs are good starting points.
RPMs only work on Red Hat Linux distributions such as Fedora Core, CentOS, and RHEL. We’re currently working on providing DEBs, which are equivalent to RPMs for Debian-based distributions such as Debian and Ubuntu.
If you have any questions about our distribution for Hadoop, or if you’d like to see packages for other systems (FreeBSD or Mac OS X anyone?), then drop us a question on our Get Satisfaction forums.