At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same soak time as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
One of the more common requests we receive from the community is to package Apache HBase with Cloudera’s Distribution for Apache Hadoop. Lately, I’ve been doing a lot of work on making Cloudera’s packages easy to use, and recently, the HBase team has pitched in to help us deliver compatible HBase packages. We’re pretty excited about this, and we’re looking forward to your feedback. A big thanks to Andrew Purtell, a Senior Architect at TrendMicro and HBase Contributor,
(guest blog post by Pete Skomoroch)
In a previous post, I outlined how to build a basic trend tracking site called trendingtopics.org with Cloudera’s Distribution for Hadoop and Hive. TrendingTopics uses Hadoop to identify the top articles trending on Wikipedia and displays related news stories and charts. The data powering the site was pulled from an Amazon EBS Wikipedia Public Dataset containing 8 months of hourly pageview logfiles.
Apache Hadoop’s jobtracker, namenode, secondary namenode, datanode, and tasktracker all generate logs. That includes logs from each of the daemons under normal operation, as well as configuration logs, statistics, standard error, standard out, and internal diagnostic information. Many users aren’t entirely sure what the differences are among these logs, how to analyze them, or even how to handle simple administrative tasks like log rotation. This blog post describes each category of log, and then details where they can be found for each Hadoop component.
In March of this year, we released our distribution for Apache Hadoop. Our initial focus was on stability and making Hadoop easy to install. This original distribution, now named CDH1, was based on the most stable version of Apache Hadoop at the time:0.18.3. We packaged up Apache Hadoop, Pig and Hive into RPMs and Debian packages to make managing Hadoop installations easier. For the first time ever, Hadoop cluster managers were able to bring up a deployment by running one of the following commands depending on your Linux distribution:
# yum install hadoop
# apt-get install hadoop
As proof of this,