Cloudera Engineering Blog · Hadoop Posts
This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.
Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.
Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for July through September 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.
As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!
|July 11||Boston||Boston HUG||Solr Committer Mark Miller on Solr+Hadoop|
|July 11||Santa Clara, Calif.||Big Data Gurus||Patrick Hunt on Solr+Hadoop|
|July 11||Palo Alto, Calif.||Cloudera Manager Meetup||Phil Zeyliger on Cloudera Manager internals|
|July 11||Kansas City, Mo.||KC Big Data||Matt Harris on Impala|
|July 17||Mountain View, Calif.||Bay Area Hadoop Meetups||Patrick Hunt on Solr+Hadoop|
|July 22||Chicago||Chicago Big Data||Hadoop and Lucene founder Doug Cutting on Solr+Hadoop|
|July 22||Portland, Ore.||OSCON 2013||Tom Wheeler on “Introduction to Apache Hadoop”|
|July 24||Portland, Ore.||OSCON 2013||Sqoop Committer Kate Ting on “Building an Impenetrable ZooKeeper”|
|July 24||Portland, Ore.||OSCON 2013||Jesse Anderson on “Doing Data Science On NFL Play by Play”|
|July 24||Portland, Ore.||OSCON 2013||Bigtop Committer Mark Grover on “Getting Hadoop, Hive and HBase up and running in less than 15 minutes”|
|July 24||Portland, Ore.||OSCON 2013||Hadoop Committer Colin McCabe on Locksmith|
|July 25||San Francisco||SF Data Engineering||Wolfgang Hoschek on Morphlines|
|July 25||Washington DC||Hadoop-DC||Joey Echeverria on Accumulo|
|Aug. 14||San Francisco||SF Hadoop Users||TBD, but we’re hosting!|
|Aug. 14||LA||LA HBase Users Meetup||HBase Committer/PMC Chair Michael Stack on HBase|
|Aug. 29||London||London Java Community||Hadoop Committer Tom White on CDK|
|Sept. 11||San Francisco||Cloudera Sessions (SOLD OUT)||Eric Sammer-led CDK lab|
|Sept. 12||New York||NYC Search, Discovery & Analytics Meetup||Solr Committer Mark Miller on Solr+Hadoop|
|Sept. 12||Cambridge, UK||Enterprise Search Cambridge UK||Tom White on Solr+Hadoop|
|Sept. 12||Los Angeles||LA Hadoop Users Group||Greg Chanan on Solr+Hadoop|
|Sept. 16||Sunnyvale, Calif.||Big Data Gurus||Eric Sammer on CDK|
|Sept. 17||Sunnyvale, Calif.||SF Large-Scale Production Engineering||Darren Lo on Hadoop Ops|
|Sept. 18||Mountain View, Calif.||Silicon Valley JUG||Wolfgang Hoschek on Morphlines|
|Sept. 19||El Dorado Hills, Calif.||NorCal Big Data||Apache Bigtop Committer Sean Mackrory on Bigtop & QuickStart VM|
|Sept. 24||Washington DC||Hadoop-DC||Doug Cutting on Apache Lucene|
Doug Cutting’s recent post about Cloudera Search included a hat-tip to Aaron McCurry, founder of the Blur project, for inspiring some of its design principles. We thought you would be interested in hearing more about Blur (which is mentored by Doug and Cloudera’s Patrick Hunt) from Aaron himself – thanks, Aaron, for the guest post below!
Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.
For those who are unfamiliar with it, Hue is a very popular, end-user focused, fully open source Web UI designed for interaction with Apache Hadoop and its ecosystem components. Founded by Cloudera employees, Hue has been around for quite some time, but only in the last 12 months has it evolved into the great ramp-up and interaction tool it is today. It’s fair to say that Hue is the most popular open source GUI for the Hadoop ecosystem among beginners — as well as a valuable tool for seasoned Hadoop users (and users generally in an enterprise environment) – and it is the only end-user tool that ships with Hadoop distributions today. In fact, Hue is even redistributed and marketed as part of other user-experience and ramp-up-on-Hadoop VMs in the market.
Just in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.
Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.
In this Customer Spotlight, I’d like to emphasize some undeniably positive use cases for Big Data, by looking at some of the ways the healthcare and life sciences industries are innovating to benefit humankind. Here are just a few examples:
Mount Sinai School of Medicine has partnered with Cloudera’s own Jeff Hammerbacher to apply Big Data to better predict and understand disease processes and treatments. The Mount Sinai School of Medicine is a top medical school in the US, noted for innovation in biomedical research, clinical care delivery, and community services. With Cloudera’s Big Data technology and Jeff’s data science expertise, Mount Sinai is better equipped to develop solutions designed for high-performance, scalable data analysis and multi-scale measurements. For example, medical research and discovery areas in genotype, gene expression and organ health will benefit from these Big Data applications.
CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.
However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.
For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.
We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.
Hadoop Summit convenes next week, and even if you’re not attending, there are a host of meetup opportunities available to you during the week.
Here are just a few, and you can find a full list here.
Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.
YARN/MR2 vs. MR1
YARN uses an updated terminology to reflect that it no longer just manages resources for MapReduce. From YARN’s perspective, a MapReduce job is an application. YARN schedules containers for map and reduce tasks to live in. What was referred to as pools in the MR1 Fair Scheduler has been updated to queue for consistency with the capacity scheduler. An excellent and deeper explanation is available here.