Cloudera Blog · Distribution Posts

If 80% of data is unstructured, is it the exception or a new rule?

Ed Albanese leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.

This week’s announcement about the availability of the Cloudera Connector for IBM Netezza is the achievement of a major milestone, but not necessarily the one you might expect. It’s not just the delivery of a useful software component; it’s also the introduction of a new generation of data management architectures.  For literally decades, data management architecture consisted of RDBMS, a BI tool and an ETL engine. Those three components assembled together gave you a bonafide data management environment. That architecture has been relevant for long enough to withstand the onslaught of data driven by the introduction of ERP, the rise and fall of client/server and several versions of web architecture. But the machines are unrelenting. They keep generating data. And there’s not just more of it, there is more you can—and often need—to do with it.

The times they are a-changin’, and unstructured data is taking over

Companies of all sizes and in nearly every vertical are increasingly tasked with decoding the information being generated by the machines they rely on most. New data sources are creating new data types, including web data, clickstreams, location data, point of sale, social data, building sensors, vehicle and aircraft data, satellite images, medical images, log files, network data and weather data… just to name a few. These data sources were but a glimmer in the eyes of the forefathers of the RDBMS and were most certainly not accounted for in its design. And yet, the percentage of data that fit into this newer bucket is growing at astounding rates. While at Netezza’s Enzee event this week, I listened to Steve Mills, IBM Senior Vice President and Group Executive for the Software Group, cite that more than 80% of the world’s data is unstructured.

So what to do with all of this data?

Reflections from Enzee Universe 2011

Bala Venkatrao is the director of product management at Cloudera.

I had the pleasure of attending Enzee Universe 2011 User Conference this week (June 20-22) in Boston. The conference was very well organized and was attended by well over 1000+ attendees, many of whom lead the Data Warehouse/Data Management functions for their companies.  This was Netezza’s largest conference so far in seven years. Netezza is known for enterprise data warehousing, and in fact, they pioneered the concept of the data warehouse appliance. Netezza is a success story: since its founding in 2000, Netezza has seen a steady growth in customers and revenues and last year (2010), IBM acquired Netezza for a whopping $1.7B.

Cloudera announced a partnership with Netezza last year and since then the two companies have been working closely to build a high-speed bi-directional connector between Netezza DW appliances and Apache Hadoop. We launched the general availability of this connector at this week at Enzee Universe 2011. You can download the connector here:  https://ccp.cloudera.com/display/SUPPORT/Downloads. Thank you to the teams at Netezza and Cloudera for making this happen. It’s been a great collaboration!

CDH 3 Demo VM installation on Mac OS X using VirtualBox

The first task is to ensure that your system is up-to-date.

This procedure has been tested on the following configuration:

CDH3 goes GA

I am very pleased to announce the general availability of Cloudera’s Distribution including Apache Hadoop, version 3. We’ve been working on this release for more than a year — our initial beta release was on March 24 of 2010, and we’ve made a number of enhancements to the software in the intervening months. This release is the culmination of that long process. It includes the hard work of the broad Apache Hadoop community and the entire team here at Cloudera.

We’ve done three things in this release that I’m particularly proud of.

First, we’ve produced what we believe the community and the industry need: A complete Hadoop-based stack for data storage and analysis.

Supported Operating Systems in CDH3

While Cloudera’s Distribution including Apache Hadoop (CDH) operating system support is covered in the documentation, we thought a quick overview of the changes in CDH3 would be helpful to highlight before CDH3 goes stable. CDH3 supports both 32-bit and 64-bit packages for Red Hat Enterprise Linux 5 and CentOS 5. A significant addition in CDH3 Beta 4 was 64-bit support for SUSE Linux Enterprise Server 11 (SLES 11). CDH3 also supports both 32-bit and 64-bit packages for the two most recent Ubuntu releases: Lucid (10.04 LTS) and Maverick (10.10). As of Beta 4, CDH3 no longer contains packages for Debian Lenny, Ubuntu Hardy, Jaunty, or Karmic. Checkout these upgrade instructions if you are using an Ubuntu release past its end of life. If you are using a release for which Cloudera’s Debian or RPM packages are not available, you can always use the tarballs from the CDH download page. If you have any questions, you can reach us on the  CDH user list.

Cloudera and Pentaho team up to simplify data management and business intelligence

Webinar: Decmeber 9, 2010, 11am PT, 2pm ET

Guest post by Thomas J. Wilson, president of Unisphere Research which produces custom research projects in conjunction with many of the leading data management and IT user groups, as well as with other industry communities including the subscribers of Database Trends and Applications magazine.

We conduct a lot of research among data architects, database professionals, business intelligence specialists and development professionals for Database Trends and Applications. One thing is becoming clearer by the day: data proliferation is taking an enormous toll on IT budgets as well as IT staff time. These burdens are a large area of concern and stress for IT departments.

Lessons learned putting Hadoop into production

Webinar : December 8th, 10-11:00am PT, 1-2:00pm ET

Presenter: Eric Sammer, Cloudera Solution Architect

Many Apache Hadoop deployments begin as small test clusters as either an electronic sandbox for analyzing data in new ways or solving a small specific business problem. Typically, as more use cases are discovered more data is loaded into the cluster. Consequently, the clusters grow to provide expanded capacity to the organization. Typically one or more of the use cases provides insight that is critical to the efficient operation of business and eventually creates a need for a full scale production Hadoop system.  As the clusters grow and business becomes more dependent on the results, challenges begin to arise in many aspects of deployment from configuring and installing to monitoring and managing the daily operations of the cluster.

Migrating to CDH

With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.

Why Migrate?

If you’re not familiar with CDH3b2, here’s what you need to know.

All versions of CDH provide:

What’s New in CDH3b2: Sqoop

Cloudera customers usually have two major sources of data: log files, which can be imported to Hadoop via Flume, and relational databases. Throughout the previous releases of CDH2 and CDH3, Cloudera has included a package we’ve developed called Sqoop. Sqoop can perform batch imports and exports between relational databases and Hadoop, storing data in HDFS and creating Hive tables to hold results. We described its motivation and some use cases in a previous blog post a while ago. In CDH3b2, we’ve included a greatly-expanded version of Sqoop which has had a major overhaul since previous releases. This version is important enough that we’re deeming it the “1.0″ release of Sqoop. In this blog post we’ll cover the highlights of the new features available in Sqoop.

New Interface

The biggest change you’ll notice is that the Sqoop command-line interface has completely changed. Users who have been embedding Sqoop in scripts may be frustrated by this incompatible change, but we think that given the amount of functionality available in Sqoop now, some refactoring is necessary, and this is the correct opportunity to do it. Sqoop is now arranged as a set of tools. If you type sqoop help, you’ll see the list of tools available. Most of the original funtionality is contained in a tool called import; running sqoop help import will list the options available to this tool.

Improved Export Performance

In CDH3b1 we provided basic support for exports: the ability to take results from HDFS and insert them back into a database. CDH3b2 features a completely rewritten export pipeline which demonstrates considerably greater throughput and scalability. You can now export gigabytes of data with high performance. For MySQL users, we’ve added a separate “direct mode” channel that uses mysqlimport to perform this job even faster.

Large Object Support

CDH2 Update 1 Now Available

Cloudera is happy to announce the availability of the first update to version 2 of our distribution for Hadoop. While major new features are planned for our release of version 3 we will regularly update version 2 with improvements and bug fixes. Check out the change log and release notes for details. You can find the packages and tarballs on our website, or simply update if you are already using our yum and apt repositories.

A notable addition in update 1 is a FUSE package for HDFS. This package allows you to easily mount HDFS as a standard file system for use with traditional Unix utilities. Check out the Mountable HDFS section in the CDH docs and the hadoop-fuse-dfs manpage for details.

We appreciate feedback! Get in touch with us on Get Satisfaction, twitter and IRC (#cloudera on freenode.net) and let us know how the update is working for you.

Newer Posts Older Posts