Considerations for Apache Hadoop and BI (part 1 of 2)
We recently met with a customer at Cloudera’s new offices and asked if he had any specific use cases in mind for the Apache Hadoop cluster that we are helping him to roll out. He replied, quite honestly, that he didn’t know. His baseline understanding is that there is value in the data that his organization is collecting today, but he’s not sure where it is. He said, “I would like to have all of this data stored forever” and then proceeded to explain that as his business expands and matures, he wants the ability to go back and analyze this data in ways he cannot foresee today.
This is an ideal use case for Hadoop and a prime example why Hadoop is such a disruptive technology. Historically, before analyzing any data set, organizations needed to model and transform the data. This requires a lot of effort to make sure the data is properly loaded, correctly structured, well-defined and typed, complete and conforms to organizational standards. Moreover, the organization must design, model and expose corresponding metadata using business intelligence and analysis tools. The key to making this all work is that an organization has not only a good understanding of current business needs, but ideally a pretty good view of what they’re going to want down the road. As data volumes grow, business needs change, and data comes from unexpected places. It becomes more difficult to keep up the transformation, structuring and modeling for all this new data.
Hadoop is a platform for capturing all this data a very low cost per byte in raw form, before it is transformed and structured. Hadoop is used to capture and consolidate data before modeling it to fit any given process as well as to keep a data that has been processed. With Hadoop data is structured as it is accessed. This means that when new data is introduced or as the relationships change, new queries can extract value from old and new data alike without requiring a complete redesign. A MapReduce job can quickly and efficiently churn through unbounded volumes of complex data and extract the needed intelligence. This freedom from constraint at load time can revolutionize an organization’s relationship with and understanding of their data.
We find that as customers start understanding this revolution and begin exploring new relationship within their data, they’re struck by a common set of observations and subsequent questions. Although loading data is easier in Hadoop, without the need modeling complex data from heterogeneous sources, our customers start to become concerned that writing MapReduce jobs, Pig scripts and Hive queries is harder than using existing tools for pre-modeling data. Because the BI ecosystem is more mature, today’s analysts have a rich interface to their data using a variety of means. Whether using an OLAP tool that can slice and dice data, a reporting framework for building and consuming dashboards, or Microsoft Excel for constructing spreadsheets out of human-sized chunks of the data, these tools make structured data much more accessible. Customers also start asking questions about how Hadoop output should be consumed. Should it be consumed using existing BI infrastructure, or is their relationship with their data now so fundamentally different that they need to start anew with a fresh set of applications for analyzing their data? We’re often asked if it’s a goal of Cloudera’s to replace relational databases and business intelligence tools.
Hadoop is not a wholesale replacement for BI and we’re not replacing relational databases. Cloudera is helping our customers marry the power of Hadoop to existing tools. While the storage and analysis of data using Hadoop differs from highly structured formats preferred by RDBMS’s and the BI tools, Hadoop does not required throwing out existing business intelligence suites in order to perform meaningful analysis on data. Sometimes, our customers use Hadoop to scale over large or loosely structured data sets and then export the results as needed into existing RDBMS. Business users are not eager to learn new tools for analysis and by integrating Hadoop, Cloudera helps shield users from this complexity. Business users continue to access processed data using well-defined and established business intelligence frameworks.
But export to an RDBMS is not always the appropriate result of an MR job. The existing BI infrastructure is not the only means of accessing data in Hadoop and new tools are not always complex or foreign. It becomes a matter of situation and preference. Hive is a SQL engine that compiles queries in a familiar language and constructs into MapReduce programs. Pig Latin was developed at Yahoo! as a new way to address complex data transformations procedurally without having to write java MapReduce code. While both are useful abstractions of MapReduce, they reflect different ways of thinking about data that correspond to different business problems.
Our customers now face an interesting set of challenges: understanding when and how to integrate Hadoop with the existing Business Intelligence environment and when to look to new tools to solve a new class of problem. In our next post we look at the top five questions when deciding how and when to use Hadoop.