Cloudera Engineering Blog · General Posts

Video Premiere: Training a New Generation of Data Scientists

Data scientists drive data as a platform to answer previously unimaginable questions. These multi-talented data professionals are in demand like never before because they identify or create some of the most exciting and potentially profitable business opportunities across industries. However, a scarcity of existing external talent will require companies of all sizes to find, develop, and train their people with backgrounds in software engineering, statistics, or traditional business intelligence as the next generation of data scientists.

Join us for the premiere of Training a New Generation of Data Scientists on Tuesday, March 26, at 2pm ET/11am PT. In this video, Cloudera’s Senior Director of Data Science, Josh Wills, will discuss what data scientists do, how they think about problems, the relationship between data science and Hadoop, and how Cloudera training can help you join this increasingly important profession. Following the video, Josh will answer your questions about data science, Hadoop, and Cloudera’s Introduction to Data Science: Building Recommender Systems course.

How To: Use Oozie Shell and Java Actions

Apache Oozie, the workflow coordinator for Apache Hadoop, has actions for running MapReduce, Apache Hive, Apache Pig, Apache Sqoop, and Distcp jobs; it also has a Shell action and a Java action. These last two actions allow us to execute any arbitrary shell command or Java code, respectively.

In this blog post, we’ll look at an example use case and see how to use both the Shell and Java actions in more detail. Please follow along below; you can get a copy of the full project at Cloudera’s GitHub as well. This how-to assumes some basic familiarity with Oozie.

Example Use Case

Welcome, KijiMR

The following guest post is provided by Aaron Kimball, CTO of WibiData.

The Kiji ecosystem has grown with the addition of a new module, KijiMR. The Kiji framework is a collection of components that offer developers a handle on building Big Data Applications. In addition to the first release, KijiSchema, we are now proud to announce the availability of a second component: KijiMR. KijiMR allows KijiSchema users to use MapReduce techniques including machine-learning algorithms and complex analytics to develop many kinds of applications using data in KijiSchema. Read on to learn more about the major features included in KijiMR and how you can use them.

What’s New in Hue 2.2?

This post is about the new release of Hue, an open source web-based interface that makes Apache Hadoop easier to use, that’s included in CDH4.2.

Hue lets you interact with Hadoop services from within your browser without having to go to a command-line interface. It features a file browser for HDFS, an Apache Oozie Application for creating workflows of data processing jobs, a job designer/browser for MapReduce, Apache Hive and Cloudera Impala query editors, a Shell, and a collection of Hadoop APIs.

New Products and Releases: Cloudera Navigator, Cloudera Enterprise BDR, and More

Today is an exciting day for Cloudera customers and users. With an update to our 100% open source platform and a number of new add-on products, every software component we ship is getting either a minor or major update. There’s a lot to cover and this blog post is only a summary. In the coming weeks we’ll do follow-on blog posts that go deeper into each of these releases.

New Products

Apache Hadoop 2.0.3-alpha Released

Last week the Apache Hadoop PMC voted to release Apache Hadoop 2.0.3-alpha, the latest in the Hadoop 2 release series. This release fixes over 500 issues (covering the Common, HDFS, MapReduce and YARN sub-projects) since the 2.0.2-alpha release in October last year. In addition to bug fixes and general improvements the more noteworthy changes include:

Cloudera is the Second "Sexiest Enterprise Startup" for 2012

Last Thursday, I had the pleasure of attending the Crunchies with Alan Saldich (our VP of Marketing) and Sarah Mustarde (our Senior Director of Corporate Marketing). The Crunchies is an awards event a la “the Oscars” but for startups. 

Ph.D. Interns at Cloudera: Bringing Big Data Back to School

The following is a series of stories from people who in the recent past worked as Engineering Interns at Cloudera. These experiences concretely illustrate how collaboration between commercial companies like Cloudera and academia, such as in the form of these internships, helps promote big data research at universities. (These experiences were previously published in the ACM student journal, XRDS.)

Yanpei Chen (Intern 2011)

A Ruby Client for Impala

Thanks to Stripe’s Colin Marc (@colinmarc) for the guest post below, and for his work on the world’s first Ruby client for Cloudera Impala!

Like most other companies, at Stripe it has become increasingly hard to answer the big and interesting questions as datasets get bigger. This is pretty insidious: the set of potential interesting questions also grows as you acquire more data. Answering questions like, “Which regions have the most developers per capita?” or “How do different countries compare in how they spend online?” might involve hours of scripting, waiting, and generally lots of lost developer time.

Cloudera at Strata Santa Clara 2013

Join us February 26 – 28 at Strata Santa Clara and learn how Cloudera has developed the de facto Platform for Big Data. Visit Cloudera in booth 701 to hear from our team in a series of presentations and partner-integrated demonstrations – agenda coming soon. We will also be hosting several celebrated authors of the Big Data community who will be available to sign copies of their published works and converse with you about the Big Data environment and specific projects within the Big Data space. Not only will our booth be packed with a full speaker and demonstration line-up and several authors, you will also have the opportunity to meet Doug Cutting. Doug helped found the Hadoop project and coined the project “Hadoop” after his son’s stuffed elephant.

Meet the Instructor: Jesse Anderson

Jesse Anderson The Hadoop Community is an invariably fascinating world.  After all, as Clouderan ATM put it in a past blog post, the user group meetups are adorably called “HUGs.” Just as the Cloudera blog has introduced you to some of the engineers, projects, and applications that serve as the head, heart, and hands of the Hadoop Community, we’re proud to add the circulatory system (to extend the metaphor), made up of Cloudera’s expert trainers and curriculum developers who bring Hadoop to new practitioners around the world every week.

Welcome to the first installment of our “Meet the Instructor” series, in which we briefly introduce you to some of the individuals endeavoring to teach Hadoop far and wide. Today, we speak to Jesse Anderson (@jessetanderson)! 

How-to: Do Apache Flume Performance Tuning (Part 1)

The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.

This is Part 1 in a series of articles about tuning the performance of Apache Flume, a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of event data.

Get a Free Hadoop Operations Ebook with Administrator Training

Start the year off with bigger questions by taking advantage of Cloudera University’s special offer for aspiring Hadoop administrators. All participants who complete a Cloudera Administrator Training for Apache Hadoop public course by the end of March 2013 will receive a free digital copy of Hadoop Operations by Eric Sammer. If you’ve been asked to maintain large and complex Hadoop clusters, this book is a must. In addition to providing practical guidance from an expert, Hadoop Operations is also a terrific companion reference to the full Cloudera Administrator course.

Cloudera’s three-day course provides administrators a comprehensive understanding of all the steps necessary to operate and manage Hadoop clusters. From installation and configuration through load balancing and tuning your cluster, Cloudera’s administration course has you covered. This course is appropriate for system administrators and others who will be setting up or maintaining a Hadoop cluster. Basic Linux experience is a prerequisite, but prior knowledge of Hadoop is not required.

New: Cloudera Manager Free Edition Demo VM

With the availability of this new demo VM containing Cloudera Manager Free Edition and CDH4.1.2 on CentOS 6.2, getting quick hands-on experience with a freeze-dried single-node Apache Hadoop cluster is just a few minutes away after the download process. 

This new addition to our growing Demo VM menagerie is available, as usual, in VMware, VirtualBox, and KVM flavors. A 64-bit host OS is required.

Introducing Hannibal: A Tool for Apache HBase Region Monitoring

The following is a guest post from Nils Kübler, the creator of the Hannibal project. He is software engineer at Sentric, a Swiss big data specialist, providing consultancy, development and training.

Hannibal aims to help Apache HBase administrators monitor the cluster in terms of region distribution and is basically a decision-making aid for manual splitting. It widens the monitoring capabilities of HBase by providing different views with interactive graphs of the cluster. Hannibal is also a Web-based tool that fits smoothly into your existing Hadoop/HBase ecosystem.

The "Ask Bigger Questions" Contest!

Have you helped your company ask bigger questions? Our mission at Cloudera University is to equip Hadoop professionals with the skills to manage, process, analyze, and monetize more data than they ever thought possible.

Over the past three years, we’ve heard many great stories from our training participants about faster cluster deployments, complex data workflows made simple, and superhero troubleshooting moments. And we’ve heard from executives in all types of businesses that staffing Cloudera Certified professionals gives them confidence that their Hadoop teams have the skills to turn data into breakthrough insights.

How to Get Rich on Big Data

The 2012 Strata + Hadoop World conference was week before last in New York City. Cloudera co-presented the conference with O’Reilly Media this year, and we were really pleased with how the event turned out. Of course we launched Cloudera Impala, but there was a ton of news from companies across the Apache Hadoop ecosystem. Andrew Brust over at ZDNet wins the prize for comprehensive coverage of all the announcements. I also liked Tony Baer’s excellent roll-up of all the SQL news on the OnStrategies blog.

One piece of coverage crossed my inbox this past week that is not generally available. Peter Goldmacher is a Managing Director and Senior Research Analyst for The Cowen Group, a financial services company headquartered in Manhattan. Cowen helps its clients invest wisely, and Peter’s job is to research and report on industry trends that could shape that investment. Peter and his colleague Joe del Callar wrote up an excellent analysis of the Big Data market after attending Strata + Hadoop World. Because their report is published primarily for Cowen’s clients, it’s not easy to link to. Peter has, however, graciously given me permission to excerpt it here. Thank you, Peter!

Training a New Generation of Data Scientists

Last week at Strata + Hadoop World 2012, we announced a new data science training and certification program. I am very excited to have been part of the team that put the program together, and I would like to answer some of the most frequently asked questions about the course and the certification that we will be offering.

Why is Cloudera offering data science training?

The primary bottleneck on the success of Hadoop is the number of people who are capable of using it effectively to solve business problems. Addressing that bottleneck with training has always been a very large part of our mission here at Cloudera, and we are very fortunate to have one of the best training teams anywhere. So far, we have trained over 15,000 Hadoop developers and administrators, and our courses and certification exams are available all over the world.

Quorum-based Journaling in CDH4.1

A few weeks back, Cloudera announced CDH 4.1, the latest update release to Cloudera’s Distribution including Apache Hadoop. This is the first release to introduce truly standalone High Availability for the HDFS NameNode, with no dependence on special hardware or external software. This post explains the inner workings of this new feature from a developer’s standpoint. If, instead, you are seeking information on configuring and operating this feature, please refer to the CDH4 High Availability Guide.

Background

Since the beginning of the project, HDFS has been designed around a very simple architecture: a master daemon, called the NameNode, stores filesystem metadata, while slave daemons, called DataNodes, store the filesystem data. The NameNode is highly reliable and efficient, and the simple architecture is what has allowed HDFS to reliably store petabytes of production-critical data in thousands of clusters for many years; however, for quite some time, the NameNode was also a single point of failure (SPOF) for an HDFS cluster. Since the first beta release of CDH4 in February, this issue has been addressed by the introduction of a Standby NameNode, which provides automatic hot failover capability to a backup. For a detailed discussion of the design of the HA NameNode, please refer to the earlier post by my colleague Aaron Myers.

Limitations of NameNode HA in Previous Versions

CDH4.1 Now Released!

Update time!  As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months.  Updates primarily comprise bug fixes but we will also add enhancements.  We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.

We’re pleased to announce the availability of CDH4.1.  We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production.  CDH4.1 is an update that has a number of fixes but also a number of useful enhancements.  Among them:

Apache Hadoop Wins Duke’s Choice Award, is a Java Ecosystem “MVP”

For those of you new to it, the Duke’s Choice Awards program was initiated by Sun Microsystems in 2002 in an effort to “celebrate extreme innovation in the world of Java technology” – in essence, it’s the “MVP” of the Java ecosystem. Since it acquired Sun in 2009, Oracle has continued the tradition of bestowing the award, and in fact has made the process more community-oriented by accepting nominations from the public and involving Java User Groups in the judging effort.

For the 2012 awards, I’m happy to report that Apache Hadoop is among the awardees - which also include the United Nations High Commission for Refugees, Liquid Robotics, and Java cloud company Jelastic Inc., among others.

About Apache Flume FileChannel

The post below was originally published via blogs.apache.org and is republished below for your reading pleasure.

This blog post is about Apache Flume’s File Channel. Apache Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

Schedule This! Strata + Hadoop World Speakers from Cloudera

We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)

The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.

Meet the Engineer: Jon Natkins

In this installment of “Meet the Engineers”, meet Jonathan Natkins,  also known as “Natty” by his friends and colleagues. 

What do you do at Cloudera, and in which Apache project are you involved?

How-to: Analyze Twitter Data with Apache Hadoop

Social media has gained immense popularity with marketing teams, and Twitter is an effective tool for a company to get people excited about its products. Twitter makes it easy to engage users and communicate directly with them, and in turn, users can provide word-of-mouth marketing for companies by discussing the products. Given limited resources, and knowing we may not be able to talk to everyone we want to target directly, marketing departments can be more efficient by being selective about whom we reach out to.

In this post, we’ll learn how we can use Apache Flume, Apache HDFS, Apache Oozie, and Apache Hive to design an end-to-end data pipeline that will enable us to analyze Twitter data. This will be the first post in a series. The posts to follow to will describe, in more depth, how each component is involved and how the custom code operates. All the code and instructions necessary to reproduce this pipeline is available on the Cloudera Github.

Who is Influential?

Community Meetups at Strata + Hadoop World 2012

Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.

Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)

Exploring Compression for Hadoop: One DBA’s Story

This guest post comes to us courtesy of Gwen Shapira (@gwenshap), a database consultant for The Pythian Group (and an Oracle ACE Director).

Most western countries use street names and numbers to navigate inside cities. But in Japan, where I live now, very few streets have them.

Cloudera Enterprise in Less Than Two Minutes

What’s to love about Cloudera Enterprise? A lot! But rather than bury you in documentation today, we’d rather bring you a less-than-two-minute-long video:

Cloudera Manager 4.0: Customer Feedback and Adoption

It’s been roughly three months since we announced GA of Cloudera Manager 4.0 (CM4) and I wanted to provide an update on its adoption and feedback from customers.

For those new to it, Cloudera Manager is the first and market-leading management platform for CDH (Cloudera’s Distribution Including Apache Hadoop). Enterprise customers are coming to expect an end-to-end tool that manages the entire lifecycle of their Hadoop operations. In fact, in a recent Cloudera customer survey, an overwhelming 95%  emphasized the need for this approach. 

Apache HBase Replication: Operational Overview

This is the second blogpost about Apache HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.

Configuration

As mentioned in HBase Replication Overview, the master cluster sends shipment of WALEdits to one or more slave clusters. This section describes the steps needed to configure replication in a master-slave mode.

  1. All tables/column families that are to be replicated must exist on both the clusters.
  2. Add the following property in $HBASE_HOME/conf/hbase-site.xml on all nodes on both clusters; set it to true. 

CDH3 update 5 is now available

We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:

HttpFS for CDH3 – The Apache Hadoop FileSystem over HTTP

HttpFS is an HTTP gateway/proxy for Apache Hadoop FileSystem implementations. HttpFS comes with CDH4 and replaces HdfsProxy (which only provided read access). Its REST API is compatible with WebHDFS (which is included in CDH4 and the upcoming CDH3u5).

HttpFs is a proxy so, unlike WebHDFS, it does not require clients be able to access every machine in the cluster. This allows clients to to access a cluster that is behind a firewall via the WebHDFS REST API. HttpFS also allows clients to access CDH3u4 clusters via the WebHDFS REST API.

Apache ZooKeeper 3.3.6 has been released

Apache ZooKeeper release 3.3.6 is now available. This is a bug fix release covering 16 issues, two of which were considered blockers. Some of the more serious issues include:

Processing Rat Brain Neuronal Signals Using an Apache Hadoop Computing Cluster – Part III

Up to this point, we’ve described our reasons for using Hadoop and Hive on our neural recordings (Part I), the reasons why the analyses of these recordings are interesting from a scientific perspective, and detailed descriptions of our implementation of these analyses using Apache Hadoop and Apache Hive (Part II). The last part of this story cuts straight to the results and then discusses important lessons we learned along the way and future goals for improving the analysis framework we’ve built so far.

Results

Here are two plots of the output data from our benchmark run.  Both plots show the same data, one in three dimensions and the other in a two-dimensional density format.

Processing Rat Brain Neuronal Signals Using an Apache Hadoop Computing Cluster – Part I

Introduction

In this three-part series of posts, we will share our experiences tackling a scientific computing challenge that may serve as a useful practical example for those readers considering Apache Hadoop and Apache Hive as an option to meet their growing technical and scientific computing needs. This first part describes some of the background behind our application and the advantages of Hadoop that make it an attractive framework in which to implement our solution. Part II dives into the technical details of the data we aimed to analyze and of our solution. Finally, we wrap up this series in Part III with a description of some of our main results, and most importantly perhaps, a list of things we learned along the way, as well as future possibilities for improvements.

Background

About a year ago, after hearing increasing buzz about big data in general, and Hadoop in particular, I (Brad Rubin) saw an opportunity to learn more at our Twin Cities (Minnesota) Java User Group.  Brock Noland, the local Cloudera representative, gave an introductory talk.  I was really intrigued by the thought of leveraging commodity computing to tackle large-scale data processing.  I teach several courses at the University of St. Thomas Graduate Programs in Software, including one in information retrieval.  While I had taught the abstract principles behind the scale and performance solutions for indexing web-sized document collections, I saw an opportunity to integrate a real-world solution into the course.

Apache HBase Replication Overview

Apache HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously,  and these transactions are batched in a configurable size (default is 64MB).  Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.

This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.

Use cases

Why we build our platform on HDFS

It’s not often the case that I have a chance to concur with my colleague E14 over at Hortonworks but his recent blog post gave the perfect opportunity.  I wanted to build on a few of E14’s points and add some of my own.

A recent GigaOm article presented 8 alternatives to HDFS.  They actually missed at least 4 others.  For over a year, Parascale marketed itself as an HDFS alternative (until it became an asset sale to Hitachi).  Appistry continues to market its HDFS alternative.  I’m not sure if it’s released yet but it is very evident that Symantec’s Veritas unit is proposing its Clustered Filesystem (CFS) as an alternative to HDFS as well.  HP Ibrix has also supported the HDFS API for some years now.

Experimenting with MapReduce 2.0

In Building and Deploying MR2, we presented a brief introduction to MapReduce in Apache Hadoop 0.23 and focused on the steps to setup a single-node cluster. In MapReduce 2.0 in Hadoop 0.23, we discussed the new architectural aspects of the MapReduce 2.0 design. This blog post highlights the main issues to consider when migrating from MapReduce 1.0 to MapReduce 2.0. Note that both MapReduce 1.0 and MapReduce 2.0 are included in CDH4.

It is important to note that, at the time of writing this blog post, MapReduce 2.0 is still Alpha, and it is not recommended to use it in production.

Apache HBase Log Splitting

In the recent blog post about the Apache HBase Write Path, we talked about the write-ahead-log (WAL), which plays an important role in preventing data loss should a HBase region server failure occur.  This blog post describes how HBase prevents data loss after a region server crashes, using an especially critical process for recovering lost updates called log splitting.

Log splitting

As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.

Watching the Clock: Cloudera’s Response to Leap Second Troubles

At 5 pm PDT on June 30, a leap second was added to the Universal Coordinated Time (UTC). Within an hour, Cloudera Support started receiving reports of systems running at 100% CPU utilization. The Support Team worked quickly to understand and diagnose the problem and soon published a solution. Bugs due to the leap second coupled with the Amazon Web Services outage would make this Cloudera’s busiest support weekend to date.

Since Hadoop is written in Java and closely interoperates with the underlying OS, Cloudera Support troubleshoots not only all 17 components in the Hadoop ecosystem, but also any underlying Linux and Java bugs. Last weekend many of our customers were affected by the now infamous “leap second” bugs. Initially, many assumed that Java and Linux would process the leap second gracefully. However, we soon discovered that this wasn’t the case and depending on the version of Linux being used, several distinct issues were observed.

Background

The Apache Hadoop Ecosystem, Visualized in Datameer

This is a guest re-post from Datameer’s Director of Marketing, Rich Taylor. The original post can be found on the Datameer blog.

Datameer uses D3.js to power our Business Infographic™ designer. I thought I would show how we visualized the Apache Hadoop ecosystem connections. First using only D3.js, and second using Datameer 2.0.

Apache Flume Development Status Update

Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG), which is currently on the trunk branch, techniques and components that can be be used to route the data, configuration validation, and finally support for serializing events.

In the past several months, contributors have been busy adding several new sources, sinks and channels to Flume. Flume now supports Syslog as a source, where sources have been added to support Syslog over TCP and UDP.

Update on Apache Bigtop (incubating)

Introduction

Ever since Cloudera decided to contribute the code and resources for what would later become Apache Bigtop (incubating), we’ve been answering a very basic question: what exactly is Bigtop and why should you or anyone in the Apache (or Hadoop) community care? The earliest and the most succinct answer (the one used for the Apache Incubator proposal) simply stated that “Bigtop is a project for the development of packaging and tests of the Hadoop ecosystem”. That was a nice explanation of how Bigtop relates to the rest of the Apache Software Foundation’s (ASF) Hadoop ecosystem projects, yet it doesn’t really help you understand the aspirations of Bigtop.

The history

Cloudera was the first company to create an open source distribution that included Apache Hadoop, releasing the first version (CDH1) back in March, 2009.  The initial goal of CDH was to make Apache Hadoop easier to adopt, providing packaging to enable users to install Hadoop on popular Linux operating systems and not have to compile from source.

Apache Oozie (incubating) 3.2.0 release

This blog was originally posted on the Apache Blog for Oozie.

In June 2012, we released Apache Oozie (incubating) 3.2.0. Oozie is currently undergoing incubation at The Apache Software Foundation (see http://incubator.apache.org/oozie).

Apache Hadoop Beyond MapReduce, Part 1: Introducing Kitten

This week, a team of researchers at Google will be presenting a paper describing a system they developed that can learn to identify objects, including the faces of humans and cats, from an extremely large corpus of unlabeled training data. It is a remarkable accomplishment, both in terms of the system’s performance (a 70% improvement over the prior state-of-the-art) and its scale: the system runs on over 16,000 CPU cores and was trained on 10 million 200×200 pixel images extracted from YouTube videos.

Doug Cutting has described Apache Hadoop as “the kernel of a distributed operating system.” Until recently, Hadoop has been an operating system that was optimized for running a certain class of applications: the ones that could be structured as a short sequence of MapReduce jobs. Although MapReduce is the workhorse programming framework for distributed data processing, there are many difficult and interesting problems– including combinatorial optimization problems, large-scale graph computations, and machine learning models that identify pictures of cats– that can benefit from a more flexible execution environment.

Apache HBase Write Path

Apache HBase is the Hadoop database, and is based on the Hadoop Distributed File System (HDFS). HBase makes it possible to randomly access and update data stored in HDFS, but files in HDFS can only be appended to and are immutable after they are created.  So you may ask, how does HBase provide low-latency reads and writes? In this blog post, we explain this by describing the write path of HBase — how data is updated in HBase.

The write path is how an HBase completes put or delete operations. This path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile. Included in the design of the write path are features that HBase uses to prevent data loss in the event of a region server failure. Therefore understanding the write path can provide insight into HBase’s native data loss prevention mechanism.

The Elephant in the Enterprise

On Tuesday, June 12th The Churchill Club of Silicon Valley hosted a panel discussion on Hadoop’s evolution from an open-source project to becoming a standard component of today’s enterprise computing fabric. The lively and dynamic discussion was moderated by Cade Metz, Editor, Wired Enterprise.

Panelists included:

Michael Driscoll, CEO, Metamarkets
Andrew Mendelsohn, SVP, Oracle Server Technologies
Mike Olson, CEO, Cloudera
Jay Parikh, VP Infrastructure Engineering, Facebook
John Schroeder, CEO, MapR

CDH4 and Cloudera Enterprise 4.0 Now Available

I’m very pleased to announce the immediate General Availability of CDH4 and Cloudera Manager 4 (part of the Cloudera Enterprise 4.0 subscription).  These releases are an exciting milestone for Cloudera customers, Cloudera users and the open source community as a whole.

Functionality
Both CDH4 and Cloudera Manager 4 are chock full of new features. Many new features will appeal to enterprises looking to move  more important workloads onto the Apache Hadoop platform. CDH4 includes high availability for the filesystem, ability to support multiple namespaces, Apache HBase table and column level security, improved performance, HBase replication and greatly improved usability and browser support for the Hue web interface. Cloudera Manager 4 includes multi-cluster and multi-version support, automation for high availability and MapReduce2, multi-namespace support, cluster-wide heatmaps, host monitoring and automated client configurations.

Hue 2.0 Packs New Features in a Refreshing UI

Hue 2.0.1 has just been released. 2.0.1 represents major improvement on top of the Hue 1.x series. To list a few key new features:

Online Apache HBase Backups with CopyTable

CopyTable is a simple Apache HBase utility that, unsurprisingly, can be used for copying individual tables within an HBase cluster or from one HBase cluster to another. In this blog post, we’ll talk about what this tool is, why you would want to use it, how to use it, and some common configuration caveats.

Use cases:

CopyTable is at its core an Apache Hadoop MapReduce job that uses the standard HBase Scan read-path interface to read records from an individual table and writes them to another table (possibly on a separate cluster) using the standard HBase Put write-path interface. It can be used for many purposes:

Newer Posts Older Posts