Just today we heard another question about integrating Apache Hadoop with Business Intelligence tools. This is one of the most common questions we receive from enterprises adopting or evaluating Hadoop. In the early stages of their projects, customers are generally not sure how to connect their BI tools to Hadoop, and when it makes sense to do so. As I wrote in BI Considerations and Hadoop Part 1, Cloudera encourages you to use your existing infrastructure wherever possible, and this includes your investments in Business Intelligence.
BI tools traditionally were designed for small volumes of structured data where Hadoop generally stores data in complex formats at scale and processes data on read using MapReduce. We give our customers recommendations for when and how to integrate Hadoop with their existing Business Intelligence environment, as well as when organizations should look to new tools to solve a new class of problem. Here are some questions to consider when determining which tools to use:
- Are you dealing with a technically difficult or intractable problem?
In the traditional Business Intelligence world, transactional data is stored in a database and then periodically loaded into a warehouse for query and analysis. The warehouse is designed and implemented ahead of time to facilitate a specific set of reports or ad hoc query. This model breaks if your data sets are growing faster than the ETL jobs that collect them. This model also breaks if you don’t have an understanding of how to utilize the data at the time you collect it.
Do you need process data sets that are growing faster than your ability to transform them? Is it too expensive to model these data sets relationally? Where is the data growing? If it’s growing around complex data types, then a relational database is probably not the most interesting place to ask questions of complex plus relational data.
- Does most of your data conform to a known schema?
If you have a known schema, or you have the time to model a new schema as well as develop the required ETL to populate that schema, Hadoop might not be a requirement for you. However, we find that increasingly our customers are coming to us because they have exploding volumes of data in complex formats and don’t know how to model an appropriate schema. They know there is value in this data, but they do not know where it is. If you are confronted with that situation, perhaps it’s best to stand up a Hadoop cluster alongside your data warehouse for complex data analysis.After you have both running together, then your primary challenge will be integrating the two in a way that makes sense. Cloudera is committed to providing the tools and infrastructure required to help address this challenge.
- Do you require real-time analysis or will batch-analysis suffice?
Although there is plenty of work being done around making Hadoop more real-time and low-latency in projects such as HBase, Hadoop was designed for batch processing of complex data types at scale. If your business requires ad-hoc exploration of structured data, then a traditional OLAP approach or an Analytic DBMS might be best. Organizations still need to collect, use, and present relational data for real time online analysis. If this is the case, that data should absolutely be properly governed and moved into a new data warehouse. Hadoop is where organizations can start to collect atomic raw data and ask new questions without increasing the expenses of their data warehouse. If you don’t necessarily have defined dimensions and clear facts in your data, and you need to identify trends, then you want to look adopting a new interface to your Hadoop cluster.If you do adopt Hadoop and still have a requirement for real-time ad-hoc query of data, then you need to talk to us about how to populate a data mart, OLAP cube or an ADBMS from the output of a MapReduce job.
- Do you require a flexible data processing methodology, or is it more important to you that your data fit into a rigid, well-understood format at the time that you model it?
Hadoop provides the ability to postpone formalizing data until you query it. Because Hadoop keeps data local to processing you can scan, extract, and transform the data at query time.Some problems that are modeled as star schemas in a data warehouse are easier in Hadoop. For example, rather than model dimensions and facts that support slowly changing dimensions and develop ETL to manage it, Hadoop can store historical data in its original form and provide access on demand. This makes the data warehouse simpler and reduces the complexity of your data management environment!
- How nimble is your organization? Do you have requirements around processes and governance that you must adhere to?
If process, governance, and compliance is a requirement, then it will most likely be worth the increased investment in your data warehouse and BI infrastructure to model and transform data directly into a relationsal store. Rather than trying to integrate with data stored in Hadoop, think about processing data within Hadoop and then moving data post-aggregation into a data warehouse that meets your governance needs.
When we’re asked whether or not Hadoop is intended to replace or supplement a BI system, the answer isn’t either or. Cloudera wants you to leverage as much of your existing infrastructure as makes sense. We are also making improvements to Hadoop to make it more familiar to someone accustomed to BI systems. Likewise, we are rolling out new tools to help existing BI systems interface with Hadoop. While Hadoop is a disruptive technology in that it allows you to store and to process data without rigidly modeling it first, it is not so disruptive as to require a new front-end to your data.
There are a lot of situations that are going to be unique to your organization. While there are probably some cases where as new presentation layer is required, we’d encourage you to exhaust your existing options before simply concluding that you need to turn away from your preexisting investments, or adopt new applications that you may not really need.