Cloudera Blog · General Posts
Start the year off with bigger questions by taking advantage of Cloudera University’s special offer for aspiring Hadoop administrators. All participants who complete a Cloudera Administrator Training for Apache Hadoop public course by the end of March 2013 will receive a free digital copy of Hadoop Operations by Eric Sammer. If you’ve been asked to maintain large and complex Hadoop clusters, this book is a must. In addition to providing practical guidance from an expert, Hadoop Operations is also a terrific companion reference to the full Cloudera Administrator course.
Cloudera’s three-day course provides administrators a comprehensive understanding of all the steps necessary to operate and manage Hadoop clusters. From installation and configuration through load balancing and tuning your cluster, Cloudera’s administration course has you covered. This course is appropriate for system administrators and others who will be setting up or maintaining a Hadoop cluster. Basic Linux experience is a prerequisite, but prior knowledge of Hadoop is not required.
Upon completion of the course, attendees also receive a voucher for a Cloudera Certified Administrator for Apache Hadoop (CCAH) exam. Certification is a great differentiator; it helps establish individuals as leaders in their field, providing customers with tangible evidence of skills and expertise.
With the availability of this new demo VM containing Cloudera Manager Free Edition and CDH4.1.2 on CentOS 6.2, getting quick hands-on experience with a freeze-dried single-node Apache Hadoop cluster is just a few minutes away after the download process.
This new addition to our growing Demo VM menagerie is available, as usual, in VMware, VirtualBox, and KVM flavors. A 64-bit host OS is required.
A few quick notes from the doc:
The following is a guest post from Nils Kübler, the creator of the Hannibal project. He is software engineer at Sentric, a Swiss big data specialist, providing consultancy, development and training.
Hannibal aims to help Apache HBase administrators monitor the cluster in terms of region distribution and is basically a decision-making aid for manual splitting. It widens the monitoring capabilities of HBase by providing different views with interactive graphs of the cluster. Hannibal is also a Web-based tool that fits smoothly into your existing Hadoop/HBase ecosystem.
Hannibal is open source (MIT License) and implemented in Scala. In its current version it supports HBase 0.90. Support for versions > 0.90 is planned and will be added soon.
The Joy of Splitting
Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.
Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.
Now, it’s your turn to tell us your bigger questions story! Cloudera University is seeking tales of Apache Hadoop success originating with training and certification. How has an investment in your education paid dividends for your company, team, customer, or career?
The 2012 Strata + Hadoop World conference was week before last in New York City. Cloudera co-presented the conference with O’Reilly Media this year, and we were really pleased with how the event turned out. Of course we launched Cloudera Impala, but there was a ton of news from companies across the Apache Hadoop ecosystem. Andrew Brust over at ZDNet wins the prize for comprehensive coverage of all the announcements. I also liked Tony Baer’s excellent roll-up of all the SQL news on the OnStrategies blog.
One piece of coverage crossed my inbox this past week that is not generally available. Peter Goldmacher is a Managing Director and Senior Research Analyst for The Cowen Group, a financial services company headquartered in Manhattan. Cowen helps its clients invest wisely, and Peter’s job is to research and report on industry trends that could shape that investment. Peter and his colleague Joe del Callar wrote up an excellent analysis of the Big Data market after attending Strata + Hadoop World. Because their report is published primarily for Cowen’s clients, it’s not easy to link to. Peter has, however, graciously given me permission to excerpt it here. Thank you, Peter!
The title of the report is Quick Take: Hadoop World Observations: One Step Closer to Mainstream Adoption. It makes the argument that Hadoop isn’t yet mainstream, but that customer adoption and vendor investment have accelerated in the last twelve months, and that an inflection point is looming. Peter and Joe say:
Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.
Why is Cloudera offering data science training?
The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems. Addressing that bottleneck with training has always been a very large part of our mission here at Cloudera, and we are very fortunate to have one of the best training teams anywhere. So far, we have trained over 15,000 Hadoop developers and administrators, and our courses and certification exams are available all over the world.
Right now, one of the biggest barriers to the widespread adoption of Hadoop is the supply of data scientists, the peculiar blend of software engineer and statistician that is capable of turning data into awesome. We’ve started to see data science courses develop at universities like Columbia, The University of Washington, and UC Berkeley (taught by Cloudera co-founder Jeff Hammerbacher). While these courses provide excellent instruction to a new generation of data scientists, the instruction they provide is necessarily limited to the students who are enrolled in those institutions, and the need for data science training is much broader and much more immediate.
A few weeks back, Cloudera announced CDH 4.1, the latest update release to Cloudera’s Distribution including Apache Hadoop. This is the first release to introduce truly standalone High Availability for the HDFS NameNode, with no dependence on special hardware or external software. This post explains the inner workings of this new feature from a developer’s standpoint. If, instead, you are seeking information on configuring and operating this feature, please refer to the CDH4 High Availability Guide.
Since the beginning of the project, HDFS has been designed around a very simple architecture: a master daemon, called the NameNode, stores filesystem metadata, while slave daemons, called DataNodes, store the filesystem data. The NameNode is highly reliable and efficient, and the simple architecture is what has allowed HDFS to reliably store petabytes of production-critical data in thousands of clusters for many years; however, for quite some time, the NameNode was also a single point of failure (SPOF) for an HDFS cluster. Since the first beta release of CDH4 in February, this issue has been addressed by the introduction of a Standby NameNode, which provides automatic hot failover capability to a backup. For a detailed discussion of the design of the HA NameNode, please refer to the earlier post by my colleague Aaron Myers.
Limitations of NameNode HA in Previous Versions
As described in the March blog post, NameNode High Availability relies on shared storage - in particular, it requires some place in which to store the HDFS edit log which can be written by the Active NameNode, and simultaneously read by the Standby NameNode. In addition, the shared storage must itself be highly available — if it becomes inaccessible, the Active NameNode will no longer be able to continue taking namespace edits.
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
For those of you new to it, the Duke’s Choice Awards program was initiated by Sun Microsystems in 2002 in an effort to “celebrate extreme innovation in the world of Java technology” – in essence, it’s the “MVP” of the Java ecosystem. Since it acquired Sun in 2009, Oracle has continued the tradition of bestowing the award, and in fact has made the process more community-oriented by accepting nominations from the public and involving Java User Groups in the judging effort.
For the 2012 awards, I’m happy to report that Apache Hadoop is among the awardees - which also include the United Nations High Commission for Refugees, Liquid Robotics, and Java cloud company Jelastic Inc., among others.
As Doug Cutting, the Hadoop project’s founder, current ASF chairman, and Cloudera’s chief architect, explains in the Java Magazine writeup about the award, “Java is the primary language of the Hadoop ecosystem…and Hadoop is the de facto standard operating system for big data. So, as the big data trend spreads, Java spreads too.”
The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.
This blog post is about Apache Flume’s File Channel. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
FileChannel is a persistent Flume channel that supports writing to multiple disks in parallel and encryption.