Cloudera Engineering Blog · Distribution Posts

If 80% of data is unstructured, is it the exception or a new rule?

Ed Albanese leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.

This week’s announcement about the availability of the Cloudera Connector for IBM Netezza is the achievement of a major milestone, but not necessarily the one you might expect. It’s not just the delivery of a useful software component; it’s also the introduction of a new generation of data management architectures.  For literally decades, data management architecture consisted of RDBMS, a BI tool and an ETL engine. Those three components assembled together gave you a bonafide data management environment. That architecture has been relevant for long enough to withstand the onslaught of data driven by the introduction of ERP, the rise and fall of client/server and several versions of web architecture. But the machines are unrelenting. They keep generating data. And there’s not just more of it, there is more you can—and often need—to do with it.

The times they are a-changin’, and unstructured data is taking over

Reflections from Enzee Universe 2011

Bala Venkatrao is the director of product management at Cloudera.

I had the pleasure of attending Enzee Universe 2011 User Conference this week (June 20-22) in Boston. The conference was very well organized and was attended by well over 1000+ attendees, many of whom lead the Data Warehouse/Data Management functions for their companies.  This was Netezza’s largest conference so far in seven years. Netezza is known for enterprise data warehousing, and in fact, they pioneered the concept of the data warehouse appliance. Netezza is a success story: since its founding in 2000, Netezza has seen a steady growth in customers and revenues and last year (2010), IBM acquired Netezza for a whopping $1.7B.

CDH 3 Demo VM installation on Mac OS X using VirtualBox

The first task is to ensure that your system is up-to-date.

This procedure has been tested on the following configuration:

CDH3 goes GA

I am very pleased to announce the general availability of Cloudera’s Distribution including Apache Hadoop, version 3. We’ve been working on this release for more than a year — our initial beta release was on March 24 of 2010, and we’ve made a number of enhancements to the software in the intervening months. This release is the culmination of that long process. It includes the hard work of the broad Apache Hadoop community and the entire team here at Cloudera.

We’ve done three things in this release that I’m particularly proud of.

Supported Operating Systems in CDH3

While Cloudera’s Distribution including Apache Hadoop (CDH) operating system support is covered in the documentation, we thought a quick overview of the changes in CDH3 would be helpful to highlight before CDH3 goes stable. CDH3 supports both 32-bit and 64-bit packages for Red Hat Enterprise Linux 5 and CentOS 5. A significant addition in CDH3 Beta 4 was 64-bit support for SUSE Linux Enterprise Server 11 (SLES 11). CDH3 also supports both 32-bit and 64-bit packages for the two most recent Ubuntu releases: Lucid (10.04 LTS) and Maverick (10.10). As of Beta 4, CDH3 no longer contains packages for Debian Lenny, Ubuntu Hardy, Jaunty, or Karmic. Checkout these upgrade instructions if you are using an Ubuntu release past its end of life. If you are using a release for which Cloudera’s Debian or RPM packages are not available, you can always use the tarballs from the CDH download page. If you have any questions, you can reach us on the  CDH user list.

Cloudera and Pentaho team up to simplify data management and business intelligence

Webinar: Decmeber 9, 2010, 11am PT, 2pm ET

Guest post by Thomas J. Wilson, president of Unisphere Research which produces custom research projects in conjunction with many of the leading data management and IT user groups, as well as with other industry communities including the subscribers of Database Trends and Applications magazine.

Lessons learned putting Hadoop into production

Migrating to CDH

With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.

Why Migrate?

If you’re not familiar with CDH3b2, here’s what you need to know.

What’s New in CDH3b2: Sqoop

Cloudera customers usually have two major sources of data: log files, which can be imported to Hadoop via Flume, and relational databases. Throughout the previous releases of CDH2 and CDH3, Cloudera has included a package we’ve developed called Sqoop. Sqoop can perform batch imports and exports between relational databases and Hadoop, storing data in HDFS and creating Hive tables to hold results. We described its motivation and some use cases in a previous blog post a while ago. In CDH3b2, we’ve included a greatly-expanded version of Sqoop which has had a major overhaul since previous releases. This version is important enough that we’re deeming it the “1.0″ release of Sqoop. In this blog post we’ll cover the highlights of the new features available in Sqoop.

New Interface

The biggest change you’ll notice is that the Sqoop command-line interface has completely changed. Users who have been embedding Sqoop in scripts may be frustrated by this incompatible change, but we think that given the amount of functionality available in Sqoop now, some refactoring is necessary, and this is the correct opportunity to do it. Sqoop is now arranged as a set of tools. If you type sqoop help, you’ll see the list of tools available. Most of the original funtionality is contained in a tool called import; running sqoop help import will list the options available to this tool.

Improved Export Performance

CDH2 Update 1 Now Available

Cloudera is happy to announce the availability of the first update to version 2 of our distribution for Hadoop. While major new features are planned for our release of version 3 we will regularly update version 2 with improvements and bug fixes. Check out the change log and release notes for details. You can find the packages and tarballs on our website, or simply update if you are already using our yum and apt repositories.

A notable addition in update 1 is a FUSE package for HDFS. This package allows you to easily mount HDFS as a standard file system for use with traditional Unix utilities. Check out the Mountable HDFS section in the CDH docs and the hadoop-fuse-dfs manpage for details.

Newer Posts Older Posts