Cloudera Engineering Blog · Hadoop Posts
In its first leg of its tour of the United States earlier this year (see photos here), The Cloudera Sessions proved to be an invaluable single-day event for business and technical leaders exploring practical applications of Apache Hadoop. So valuable, in fact, that we’ve extended the tour with dates/cities this September and October.
Welcome to our second edition of “This Month in the Ecosystem.” (See the inaugural edition here.) Although August was not as busy as July, there are some very notable highlights to report.
One of the first questions Cloudera customers raise when getting started with Apache Hadoop is how to select appropriate hardware for their new Hadoop clusters.
Although Hadoop is designed to run on industry-standard hardware, recommending an ideal cluster configuration is not as easy as delivering a list of hardware specifications. Selecting hardware that provides the best balance of performance and economy for a given workload requires testing and validation. (For example, users with IO-intensive workloads will invest in more spindles per core.)
As announced last Sunday (Aug. 25) on the project mailing list, Apache Hadoop 2.1.0 is the first beta release for Hadoop 2. (See the Release Notes for full list of new features and fixes.) Our congratulations to the Hadoop community for reaching this important milestone in the ongoing adoption of the core Hadoop platform!
With the release of this new beta, and the follow-on GA release on the horizon, we expect to see more customers exploring Hadoop 2 for production use cases. In fact, the upcoming CDH5 beta will be based on the Hadoop 2 GA release, delivering features that we’ve thoroughly tested against enterprise requirements, including (but not limited to):
The guest post below is provided by Justin Langseth, Founder & CEO of Zoomdata, Inc. Thanks, Justin!
What if you could affordably manage billions of rows of raw Big Data and let typical business people analyze it at the speed of thought in beautiful, interactive visuals? What if you could do all the above without worrying about structuring that data in a data warehouse schema, moving it, and pre-defining reports and dashboards? With the approach I’ll describe below, you can.
One of the key principles behind Apache Hadoop is the idea that moving computation is cheaper than moving data — we prefer to move the computation to the data whenever possible, rather than the other way around. Because of this, the Hadoop Distributed File System (HDFS) typically handles many “local reads” reads where the reader is on the same node as the data:
This week, I’d like to shine a spotlight on innovative work the National Institutes of Health (NIH) is working on, leveraging Big Data, in the area of genomic research. Understanding DNA structure and functions is a very data-intensive, complex, and expensive undertaking. Apache Hadoop is making it more affordable and feasible to process, store, and analyze this data, and the NIH is embracing the technology for this reason. In fact, it has initiated a Big Data center of excellence — which it calls Big Data to Knowledge (BD2K) — to accelerate innovations in bioinformatics using Big Data, which will ultimately help us better understand and control various diseases and disorders.
Bob Gourley — a friend of Cloudera’s who wears many hats including publisher of CTOvision.com, CTO of Crucial Point LLC, and GigaOm analyst — recently interviewed Dr. Mark Guyer, the deputy director of the NIH’s National Human Genome Research Institute (NHGRI), about the BD2K effort.
Earlier this week, our partners NetApp and Cisco announced the Flexpod Select Family with support for Cloudera’s Distribution including Apache Hadoop (CDH).
We’re looking forward to the expansion of Flexpod Select to include Hadoop, as it provides additional options for customers to consume the benefits of the Cloudera Enterprise platform.
Strata Conference + Hadoop World 2013 is looming on the horizon and pacing to be the largest gathering of Big Data professionals on the globe. As co-hosts with O’Reilly, we have seen the conference thrive, grow, and are excited about the upcoming Oct. 28 – 30 event!
The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)
Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.