Cloudera Engineering Blog · Hadoop Posts
I’ve always held a strong bias that education is most effective when the student learns by doing. As a developer of technical curricula, my goal is to have training participants engage with real and relevant problems as much as possible through hands-on exercises. The high rate at which Apache Hadoop is changing, both as a technology and as an ecosystem, makes developing Cloudera training courses not only demanding but also seriously fun and rewarding.
I recently undertook the challenge of upgrading the Cloudera Administrator Training for Apache Hadoop. I more than quadrupled the amount of hands-on exercises from the previous version, adding a full day to the course. At four days, it’s now the most thorough training for Hadoop administrators and truly the best way to start building expertise.
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
Why would any company be interested in searching through its vast trove of email? A better question is: Why wouldn’t everybody be interested?
Email has become the most widespread method of communication we have, so there is much value to be extracted by making all emails searchable and readily available for further analysis. Some common use cases that involve email analysis are fraud detection, customer sentiment and churn, lawsuit prevention, and that’s just the tip of the iceberg. Each and every company can extract tremendous value based on its own business needs.
In its first leg of its tour of the United States earlier this year (see photos here), The Cloudera Sessions proved to be an invaluable single-day event for business and technical leaders exploring practical applications of Apache Hadoop. So valuable, in fact, that we’ve extended the tour with dates/cities this September and October.
Welcome to our second edition of “This Month in the Ecosystem.” (See the inaugural edition here.) Although August was not as busy as July, there are some very notable highlights to report.
One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.
Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)
As announced last Sunday (Aug. 25) on the project mailing list, Apache Hadoop 2.1.0 is the first beta release for Hadoop 2. (See the Release Notes for full list of new features and fixes.) Our congratulations to the Hadoop community for reaching this important milestone in the ongoing adoption of the core Hadoop platform!
With the release of this new beta, and the follow-on GA release on the horizon, we expect to see more customers exploring Hadoop 2 for production use cases. In fact, the upcoming CDH5 beta will be based on the Hadoop 2 GA release, delivering features that we’ve thoroughly tested against enterprise requirements, including (but not limited to):
The guest post below is provided by Justin Langseth, Founder & CEO of Zoomdata, Inc. Thanks, Justin!
What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.
One of the key principles behind Apache Hadoop is the idea that moving computation is cheaper than moving data — we prefer to move the computation to the data whenever possible, rather than the other way around. Because of this, the Hadoop Distributed File System (HDFS) typically handles many “local reads” reads where the reader is on the same node as the data:
This week, I’d like to shine a spotlight on innovative work the National Institutes of Health (NIH) is working on, leveraging Big Data, in the area of genomic research. Understanding DNA structure and functions is a very data-intensive, complex, and expensive undertaking. Apache Hadoop is making it more affordable and feasible to process, store, and analyze this data, and the NIH is embracing the technology for this reason. In fact, it has initiated a Big Data center of excellence — which it calls Big Data to Knowledge (BD2K) — to accelerate innovations in bioinformatics using Big Data, which will ultimately help us better understand and control various diseases and disorders.
Bob Gourley — a friend of Cloudera’s who wears many hats including publisher of CTOvision.com, CTO of Crucial Point LLC, and GigaOm analyst — recently interviewed Dr. Mark Guyer, the deputy director of the NIH’s National Human Genome Research Institute (NHGRI), about the BD2K effort.