Cloudera Engineering Blog · Hadoop Posts

Introducing Morphlines: The Easy Way to Build and Integrate ETL Apps for Hadoop

This post is the first in a series of blog posts about Cloudera Morphlines, a new command-based framework that simplifies data preparation for Apache Hadoop workloads. To check it out or help contribute, you can find the code here.

Cloudera Morphlines is a new open source framework that reduces the time and effort necessary to integrate, build, and change Hadoop processing applications that extract, transform, and load data into Apache Solr, Apache HBase, HDFS, enterprise data warehouses, or analytic online dashboards. If you want to integrate, build, or facilitate transformation pipelines without programming and without substantial MapReduce skills, and get the job done with a minimum amount of fuss and support costs, this post gets you started.

Where to Find Cloudera Tech Talks Through September 2013

Below please find our regularly scheduled quarterly update about where to find tech talks by Cloudera employees this year – this time, for July through September 2013. Note that this list will be continually curated during the period; complete logistical information may not be available yet.

As always, we’re standing by to assist your meetup by providing speakers, sponsorships, and schwag!

Date City Venue Speaker(s)
July 11 Boston Boston HUG Solr Committer Mark Miller on Solr+Hadoop
July 11 Santa Clara, Calif. Big Data Gurus Patrick Hunt on Solr+Hadoop
July 11 Palo Alto, Calif. Cloudera Manager Meetup Phil Zeyliger on Cloudera Manager internals
July 11 Kansas City, Mo. KC Big Data Matt Harris on Impala
July 17 Mountain View, Calif. Bay Area Hadoop Meetups Patrick Hunt on Solr+Hadoop
July 22 Chicago Chicago Big Data Hadoop and Lucene founder Doug Cutting on Solr+Hadoop
July 22 Portland, Ore. OSCON 2013 Tom Wheeler on “Introduction to Apache Hadoop”
July 24 Portland, Ore. OSCON 2013 Sqoop Committer Kate Ting on “Building an Impenetrable ZooKeeper”
July 24 Portland, Ore. OSCON 2013 Jesse Anderson on “Doing Data Science On NFL Play by Play”
July 24 Portland, Ore. OSCON 2013 Bigtop Committer Mark Grover on “Getting Hadoop, Hive and HBase up and running in less than 15 minutes”
July 24 Portland, Ore. OSCON 2013 Hadoop Committer Colin McCabe on Locksmith
July 25 San Francisco SF Data Engineering Wolfgang Hoschek on Morphlines
July 25 Washington DC Hadoop-DC Joey Echeverria on Accumulo
Aug. 14 San Francisco SF Hadoop Users TBD, but we’re hosting!
Aug. 14 LA LA HBase Users Meetup HBase Committer/PMC Chair Michael Stack on HBase
Aug. 29 London London Java Community Hadoop Committer Tom White on CDK
Sept. 11 San Francisco Cloudera Sessions (SOLD OUT) Eric Sammer-led CDK lab
Sept. 12 New York NYC Search, Discovery & Analytics Meetup Solr Committer Mark Miller on Solr+Hadoop
Sept. 12 Cambridge, UK Enterprise Search Cambridge UK Tom White on Solr+Hadoop
Sept. 12 Los Angeles LA Hadoop Users Group Greg Chanan on Solr+Hadoop
Sept. 16 Sunnyvale, Calif. Big Data Gurus Eric Sammer on CDK
Sept. 17 Sunnyvale, Calif. SF Large-Scale Production Engineering Darren Lo on Hadoop Ops
Sept. 18 Mountain View, Calif. Silicon Valley JUG Wolfgang Hoschek on Morphlines
Sept. 19 El Dorado Hills, Calif. NorCal Big Data Apache Bigtop Committer Sean Mackrory on Bigtop & QuickStart VM
Sept. 24 Washington DC Hadoop-DC Doug Cutting on Apache Lucene

The Blur Project: Marrying Hadoop with Lucene

Doug Cutting’s recent post about Cloudera Search included a hat-tip to Aaron McCurry, founder of the Blur project, for inspiring some of its design principles. We thought you would be interested in hearing more about Blur (which is mentored by Doug and Cloudera’s Patrick Hunt) from Aaron himself – thanks, Aaron, for the guest post below!

Blur is an Apache Incubator project that provides distributed search functionality on top of Apache Hadoop, Apache Lucene, Apache ZooKeeper, and Apache Thrift. When I started building Blur three years ago, there wasn’t a search solution that had a solid integration with the Hadoop ecosystem. Our initial needs were to be able to index our data using MapReduce, store indexes in HDFS, and serve those indexes from clusters of commodity servers while remaining fault tolerant. Blur was built specifically for Hadoop — taking scalability, redundancy, and performance into consideration from the very start — while leveraging all the great features that already exist in the Hadoop stack.

What a Great Year for Hue Users!

HueWith the recent release of CDH 4.3, which contains Hue 2.3, I’d like to report on the fantastic progress of Hue in the past year.

For those who are unfamiliar with it, Hue is a very popular, end-user focused, fully open source Web UI designed for interaction with Apache Hadoop and its ecosystem components. Founded by Cloudera employees, Hue has been around for quite some time, but only in the last 12 months has it evolved into the great ramp-up and interaction tool it is today. It’s fair to say that Hue is the most popular open source GUI for the Hadoop ecosystem among beginners — as well as a valuable tool for seasoned Hadoop users (and users generally in an enterprise environment) – and it is the only end-user tool that ships with Hadoop distributions today. In fact, Hue is even redistributed and marketed as part of other user-experience and ramp-up-on-Hadoop VMs in the market.

Apache Bigtop: The "Fedora of Hadoop" is Now Built on Hadoop 2.x

BigtopJust in time for Hadoop Summit 2013, the Apache Bigtop team is very pleased to announce the release of Bigtop 0.6.0: The very first release of a fully integrated Big Data management distribution built on the currently most advanced Hadoop 2.x, Hadoop 2.0.5-alpha.

Bigtop, as many of you might already know, is a project aimed at creating a 100% open source and community-driven Big Data management distribution based on Apache Hadoop. (You can learn more about it by reading one of our previous blog posts on Apache Blogs.) Bigtop also plays an important role in CDH, which utilizes its packaging code from Bigtop — Cloudera takes pride in developing open source packaging code and contributing the same back to the community.

Customer Spotlight: Big Data Making a Big Impact in Healthcare and Life Sciences

In this Customer Spotlight, I’d like to emphasize some undeniably positive use cases for Big Data, by looking at some of the ways the healthcare and life sciences industries are innovating to benefit humankind. Here are just a few examples:

Mount Sinai School of Medicine has partnered with Cloudera’s own Jeff Hammerbacher to apply Big Data to better predict and understand disease processes and treatments. The Mount Sinai School of Medicine is a top medical school in the US, noted for innovation in biomedical research, clinical care delivery, and community services. With Cloudera’s Big Data technology and Jeff’s data science expertise, Mount Sinai is better equipped to develop solutions designed for high-performance, scalable data analysis and multi-scale measurements. For example, medical research and discovery areas in genotype, gene expression and organ health will benefit from these Big Data applications.

Hadoop for Everyone: Inside Cloudera Search

CDH, Cloudera’s 100% open source distribution of Apache Hadoop and related projects, has successfully enabled Big Data processing for many years. The typical approach is to ingest a large set of a wide variety of data into HDFS or Apache HBase for cost-efficient storage and flexible, scalable processing. Over time, various tools to allow for easier access have emerged — so you can now interact with Hadoop through various programming methods and the very familiar structured query capabilities of SQL.

However, many users with less interest in programmatic interaction have been shut out of the value that Hadoop creates from Big Data. And teams trying to achieve more innovative processing struggle with a time-efficient way to interact with, and explore, the data in Hadoop or HBase.

QuickStart VM: Now with Real-Time Big Data

For years, Cloudera has provided virtual machines that give you a working Apache Hadoop environment out-of-the-box. It’s the quickest way to learn and experiment with Hadoop right from your desktop.

We’re constantly updating and improving the QuickStart VM, and in the latest release there are two of Cloudera’s new products that give you easier and faster access to your data: Cloudera Search and Cloudera Impala. We’ve also added corresponding applications to Hue – an open source web-based interface for Hadoop, and the easiest way to interact with your data.

Meetups at Hadoop Summit

Hadoop Summit convenes next week, and even if you’re not attending, there are a host of meetup opportunities available to you during the week.

Here are just a few, and you can find a full list here.

Improvements in the Hadoop YARN Fair Scheduler

Starting in CDH 4.2, YARN/MapReduce 2 (MR2) includes an even more powerful Fair Scheduler. In addition to doing nearly all that it could do in MapReduce 1 (MR1), the YARN Fair Scheduler can schedule non-MapReduce jobs, schedule based on fine-grained memory instead of slots, and support hierarchical queues. In this post, you’ll learn what the Fair Scheduler’s role is and how it fulfills it, what it means to be a YARN “scheduler,” and dive into its new features and how to get them running on your cluster.

YARN/MR2 vs. MR1

YARN uses an updated terminology to reflect that it no longer just manages resources for MapReduce. From YARN’s perspective, a MapReduce job is an application. YARN schedules containers for map and reduce tasks to live in. What was referred to as pools in the MR1 Fair Scheduler has been updated to queue for consistency with the capacity scheduler. An excellent and deeper explanation is available here.

How Does it Work?

Newer Posts Older Posts