If 80% of data is unstructured, is it the exception or a new rule?

Categories: CDH Community

Ed Albanese leads business development for Cloudera. He is responsible for identifying new markets, revenue opportunities and strategic alliances for the company.

This week’s announcement about the availability of the Cloudera Connector for IBM Netezza is the achievement of a major milestone, but not necessarily the one you might expect. It’s not just the delivery of a useful software component; it’s also the introduction of a new generation of data management architectures.  For literally decades, data management architecture consisted of RDBMS, a BI tool and an ETL engine. Those three components assembled together gave you a bonafide data management environment. That architecture has been relevant for long enough to withstand the onslaught of data driven by the introduction of ERP, the rise and fall of client/server and several versions of web architecture. But the machines are unrelenting. They keep generating data. And there’s not just more of it, there is more you can—and often need—to do with it.

The times they are a-changin’, and unstructured data is taking over

Companies of all sizes and in nearly every vertical are increasingly tasked with decoding the information being generated by the machines they rely on most. New data sources are creating new data types, including web data, clickstreams, location data, point of sale, social data, building sensors, vehicle and aircraft data, satellite images, medical images, log files, network data and weather data… just to name a few. These data sources were but a glimmer in the eyes of the forefathers of the RDBMS and were most certainly not accounted for in its design. And yet, the percentage of data that fit into this newer bucket is growing at astounding rates. While at Netezza’s Enzee event this week, I listened to Steve Mills, IBM Senior Vice President and Group Executive for the Software Group, cite that more than 80% of the world’s data is unstructured.

So what to do with all of this data?

For a large and growing number of companies, the answer lies in adding Apache Hadoop to their data management architecture. That’s right – add, not replace. And that, in and of itself, is significant. “How does Hadoop work with my data warehouse?” is one of the most common questions I get asked at tradeshows. My answer, in short form, is “it allows you to add more data to your data warehouse.” In longer form, I answer with a workflow diagram like the one below. There are four steps in this workflow.

  • Step 1. Data staging and loading. With Hadoop, you don’t need a schema defined before you load it. Just like a file system, you can load data as fast as you can copy data.
  • Step 2. Exploratory analytics. Without moving the data, you can perform analytics using a wide variety of languages (From SQL to Python to Java and C#). You can add a schema – or not. Most customers are analyzing this data to determine its value; is it worth sharing with a wider user population?
  • Step 3. Transform and Data Pipelines. If the data has value to users of existing OLAP or real-time databases, it can be structured – in place – by bringing the processing to the data instead of moving the data to a processing engine. Hadoop is a magnificent processing engine.
  • Step 4. Data doesn’t stay “hot” forever, but that doesn’t mean it should be put in the trash or trucked off to mountains made of iron. Hadoop can store data inexpensively for long periods of time, whether it be raw atomic data or aggregates that have worn out their welcome in traditional RDBMS systems.

Hadoop let’s customers use “the other 80%” of data within their EDW

EDWs like IBM Netezza are tremendous tools. Whip fast, capable of reliably storing important data and able to serve it to a wide variety of clients using clean, clear interfaces. If the EDW is to maintain its position as the system of record for its customers, it needs to address the 80% of the data being created today that fits into a new category, one the RDBMS was not designed to handle natively. It is this “other 80%” of data that is increasingly the source of competitive advantage or breakthrough insight for the most recognized brands in the world.

And it is the undeniable need for both tools in a modern data management architecture that makes the availability of the Cloudera Connector for IBM Netezza noteworthy.  Matt Rollender, a good friend of mine at IBM Netezza and VP of Strategic Integration and Alliances agrees. At the Enzee event this week, I caught him in the hallways and he mentioned that “We are seeing multiple source systems, higher volumes and, perhaps most notably, ‘unqualified data’ everywhere. Our customers are interested in finding easier ways to explore, structure and then use that data in IBM Netezza.  The Cloudera Connector makes that possible and that’s why it is so exciting.” If you are a user of IBM Netezza, you can begin adding more data to your warehouse immediately. You can do that by adding Apache Hadoop to your data management architecture to collect data as quickly as it is generated, sort out whether it is valuable enough to share more widely, structure it into useful aggregates and then deliver it to IBM Netezza, where end users can explore, discover and report with high usability and reliability. While we are seeing the interest across many verticals, Brad Terrell of IBM Netezza has been working with customers in the Media and Entertainment space where the energy has been especially high and took the time to reflect on some of that enthusiasm here.

You can download the Cloudera Connector for IBM Netezza here. It is a freely available product that can be used with the equally free Cloudera Distribution including Apache Hadoop (CDH). Both the Cloudera Connector IBM Netezza and CDH work with Cloudera Enterprise.