Cloudera Developer Blog · Hadoop Posts
I’ve always held a strong bias that education is most effective when the student learns by doing. As a developer of technical curricula, my goal is to have training participants engage with real and relevant problems as much as possible through hands-on exercises. The high rate at which Apache Hadoop is changing, both as a technology and as an ecosystem, makes developing Cloudera training courses not only demanding but also seriously fun and rewarding.
I recently undertook the challenge of upgrading the Cloudera Administrator Training for Apache Hadoop. I more than quadrupled the amount of hands-on exercises from the previous version, adding a full day to the course. At four days, it’s now the most thorough training for Hadoop administrators and truly the best way to start building expertise.
While developing the course, I collaborated with some of the most knowledgeable Hadoop administrators I could find, including Eric Sammer, Amandeep Khurana, Kathleen Ting, Romain Rigaux, and many other smart folks at Cloudera. The upgrades to the curriculum and exercises are based on best practices used to resolve our customers’ biggest problems. These insights resonate throughout the course, including the determination that administrators should learn installation, configuration, maintenance, monitoring, and troubleshooting using the standard Hadoop tools. Although we certainly hope that Hadoop users take advantage of Cloudera Manager to simplify and streamline many of these tasks, we believe that every good administrator needs to first take a look under the hood and tinker with Hadoop’s guts. There’s no replacement for get-your-hands-dirty experience to achieve expertise.
Cluster in the Cloud
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?
Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.
A little over a year ago we described how to archive and index emails using HDFS and Apache Solr. However, at that time, searching and analyzing emails were still relatively cumbersome and technically challenging tasks. We have come a long way in document indexing automation since then — especially with the recent introduction of Cloudera Search, it is now easier than ever to extract value from the corpus of available information.
In its first leg of its tour of the United States earlier this year (see photos here), The Cloudera Sessions proved to be an invaluable single-day event for business and technical leaders exploring practical applications of Apache Hadoop. So valuable, in fact, that we’ve extended the tour with dates/cities this September and October.
Based on feedback from previous attendees, we’ve customized the agenda to be even more targeted for real-world use cases. Furthermore, we’ve added a new Hadoop application development lab, where developers can get hands-on direction for using the Cloudera Development Kit (CDK) to more easily build apps upon, or integrate existing infrastructure with, the Hadoop platform.
Welcome to our second edition of “This Month in the Ecosystem.” (See the inaugural edition here.) Although August was not as busy as July, there are some very notable highlights to report.
One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.
Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)
In this blog post, you’ll learn some of the principles of workload evaluation and the critical role it plays in hardware selection. You’ll also learn the various factors that Hadoop administrators should take into account during this process.
Marrying Storage with Compute
As announced last Sunday (Aug. 25) on the project mailing list, Apache Hadoop 2.1.0 is the first beta release for Hadoop 2. (See the Release Notes for full list of new features and fixes.) Our congratulations to the Hadoop community for reaching this important milestone in the ongoing adoption of the core Hadoop platform!
With the release of this new beta, and the follow-on GA release on the horizon, we expect to see more customers exploring Hadoop 2 for production use cases. In fact, the upcoming CDH5 beta will be based on the Hadoop 2 GA release, delivering features that we’ve thoroughly tested against enterprise requirements, including (but not limited to):
The guest post below is provided by Justin Langseth, Founder & CEO of Zoomdata, Inc. Thanks, Justin!
What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.
The traditional Apache Hadoop approach — in which you store all your data in HDFS and do batch processing through MapReduce — works well for data geeks and data scientists, who can write MapReduce jobs and wait hours for them to run before asking the next question. But many businesses have never even heard of Hadoop, don’t employ a data scientist, and want their data questions answered in a second or two — not in hours.
One of the key principles behind Apache Hadoop is the idea that moving computation is cheaper than moving data — we prefer to move the computation to the data whenever possible, rather than the other way around. Because of this, the Hadoop Distributed File System (HDFS) typically handles many “local reads” reads where the reader is on the same node as the data:
Initially, local reads in HDFS were handled the same way as remote reads: the client connected to the DataNode via a TCP socket and transferred the data via DataTransferProtocol. This approach was simple, but it had some downsides. For example, the DataNode had to keep a thread around and a TCP socket for each client that was reading a block. There was the overhead of the TCP protocol in the kernel, as well as the overhead of DataTransferProtocol itself. There was room to optimize.
This week, I’d like to shine a spotlight on innovative work the National Institutes of Health (NIH) is working on, leveraging Big Data, in the area of genomic research. Understanding DNA structure and functions is a very data-intensive, complex, and expensive undertaking. Apache Hadoop is making it more affordable and feasible to process, store, and analyze this data, and the NIH is embracing the technology for this reason. In fact, it has initiated a Big Data center of excellence — which it calls Big Data to Knowledge (BD2K) — to accelerate innovations in bioinformatics using Big Data, which will ultimately help us better understand and control various diseases and disorders.
Bob Gourley — a friend of Cloudera’s who wears many hats including publisher of CTOvision.com, CTO of Crucial Point LLC, and GigaOm analyst — recently interviewed Dr. Mark Guyer, the deputy director of the NIH’s National Human Genome Research Institute (NHGRI), about the BD2K effort.
Some key highlights from the interview: