Considerations for Apache Hadoop and BI (part 2 of 2)

Categories: General Hadoop

Just today we heard another question about integrating Apache Hadoop with Business Intelligence tools. This is one of the most common questions we receive from enterprises adopting or evaluating Hadoop. In the early stages of their projects, customers are generally not sure how to connect their BI tools to Hadoop, and when it makes sense to do so. As I wrote in BI Considerations and Hadoop Part 1, Cloudera encourages you to use your existing infrastructure wherever possible, and this includes your investments in Business Intelligence.

BI tools traditionally were designed for small volumes of structured data where Hadoop generally stores data in complex formats at scale and processes data on read using MapReduce. We give our customers recommendations for when and how to integrate Hadoop with their existing Business Intelligence environment, as well as when organizations should look to new tools to solve a new class of problem. Here are some questions to consider when determining which tools to use:

  1. Are you dealing with a technically difficult or intractable problem?
    In the traditional Business Intelligence world, transactional data is stored in a database and then periodically loaded into a warehouse for query and analysis. The warehouse is designed and implemented ahead of time to facilitate a specific set of reports or ad hoc query. This model breaks if your data sets are growing faster than the ETL jobs that collect them. This model also breaks if you don’t have an understanding of how to utilize the data at the time you collect it.
    Do you need process data sets that are growing faster than your ability to transform them? Is it too expensive to model these data sets relationally? Where is the data growing? If it’s growing around complex data types, then a relational database is probably not the most interesting place to ask questions of complex plus relational data.
  2. Does most of your data conform to a known schema?
    If you have a known schema, or you have the time to model a new schema as well as develop the required ETL to populate that schema, Hadoop might not be a requirement for you. However, we find that increasingly our customers are coming to us because they have exploding volumes of data in complex formats and don’t know how to model an appropriate schema. They know there is value in this data, but they do not know where it is. If you are confronted with that situation, perhaps it’s best to stand up a Hadoop cluster alongside your data warehouse for complex data analysis.After you have both running together, then your primary challenge will be integrating the two in a way that makes sense. Cloudera is committed to providing the tools and infrastructure required to help address this challenge.
  3. Do you require real-time analysis or will batch-analysis suffice?
    Although there is plenty of work being done around making Hadoop more real-time and low-latency in projects such as HBase, Hadoop was designed for batch processing of complex data types at scale. If your business requires ad-hoc exploration of structured data, then a traditional OLAP approach or an Analytic DBMS might be best. Organizations still need to collect, use, and present relational data for real time online analysis. If this is the case, that data should absolutely be properly governed and moved into a new data warehouse. Hadoop is where organizations can start to collect atomic raw data and ask new questions without increasing the expenses of their data warehouse. If you don’t necessarily have defined dimensions and clear facts in your data, and you need to identify trends, then you want to look adopting a new interface to your Hadoop cluster.If you do adopt Hadoop and still have a requirement for real-time ad-hoc query of data, then you need to talk to us about how to populate a data mart, OLAP cube or an ADBMS from the output of a MapReduce job.
  4. Do you require a flexible data processing methodology, or is it more important to you that your data fit into a rigid, well-understood format at the time that you model it?
    Hadoop provides the ability to postpone formalizing data until you query it. Because Hadoop keeps data local to processing you can scan, extract, and transform the data at query time.Some problems that are modeled as star schemas in a data warehouse are easier in Hadoop. For example, rather than model dimensions and facts that support slowly changing dimensions and develop ETL to manage it, Hadoop can store historical data in its original form and provide access on demand. This makes the data warehouse simpler and reduces the complexity of your data management environment!
  5. How nimble is your organization? Do you have requirements around processes and governance that you must adhere to?
    If process, governance, and compliance is a requirement, then it will most likely be worth the increased investment in your data warehouse and BI infrastructure to model and transform data directly into a relationsal store. Rather than trying to integrate with data stored in Hadoop, think about processing data within Hadoop and then moving data post-aggregation into a data warehouse that meets your governance needs.

When we’re asked whether or not Hadoop is intended to replace or supplement a BI system, the answer isn’t either or. Cloudera wants you to leverage as much of your existing infrastructure as makes sense. We are also making improvements to Hadoop to make it more familiar to someone accustomed to BI systems. Likewise, we are rolling out new tools to help existing BI systems interface with Hadoop. While Hadoop is a disruptive technology in that it allows you to store and to process data without rigidly modeling it first, it is not so disruptive as to require a new front-end to your data.

There are a lot of situations that are going to be unique to your organization. While there are probably some cases where as new presentation layer is required, we’d encourage you to exhaust your existing options before simply concluding that you need to turn away from your preexisting investments, or adopt new applications that you may not really need.


6 responses on “Considerations for Apache Hadoop and BI (part 2 of 2)

  1. Tadaoki Uesugi

    As you suggested, basically I think that Hadoop is useful when users just try to find some structure or value of their data under the situation that they do not know how to model it or how to make their schema.

    However, in this case, I do not think that most users have to use all of their data. For example, if they have some data accumulated for one-year, they do their try-and-error task just by using a small amount of their data, like data for one-week.

    In short, what I would like to say is that in their try-and-error phase most people do not need many servers even if they are commoditized ones and do not also need any parallel processors to deal with big data like Hadoop. One or two servers might be sufficient. After they find some key feature or structure to understand their data, they should move their data to their existing BI infrastructure or Hadoop&Hive/Pig and analyze it.

    What do you think? I hope you could answer me.

  2. Jeff Bean

    Two issues touched on in your comment: data sampling and cluster size. Both have to do with how you develop your jobs, and is a somewhat different issue.

    It’s true that a developer doesn’t need the entire data set in order to develop effective queries in Hadoop and that it’s better to work with a smaller data set in initial development and exploration whether you are working with Hadoop or a traditional BI tool.

    However, we do find our customers are using Hadoop to archive data long term in the format in which it is captured, instead of or in addition to transforming it and archiving it into a data warehouse. This is probably due to the low cost of reliable data offered with Hadoop.

    Once you have data of that volume sitting around, you can ask questions you weren’t able to ask before. You can take smaller, distributed slices of data from across the set (for example, one week per year over some period of years) and work with it in a smaller development cluster.

    The considerations around latency, governance, data volume growth, and structure remain.

    Hope this makes sense.

  3. Senthil

    This article is very interesting. I have 2 questions,

    1. As mentioned, BI-Hadoop integration with SQL/JDBC/ODBC seems to be in top demand. Hive’s JDBC interface does not provide enought metadata information. What other options are available?

    2. If export to RDBMS/ADBMS is an option, do we need to keep data in both HDFS and the RDB/ADB (and pay too)? How to route low latency queries to RDB/ADB and aggreagtion/batch queries to Hive?

    These two are the main problems I see currently. Again excellent post from the cloudera team!

  4. Jeff Bean

    1. JDBC/ODBC/SQL is in top demand because people don’t want to have to switch tools to get their work done. We try to accommodate this as much as possible and things will get better. But there are also some fundamental limitations: Hive metadata isn’t as strict as an RDBMS so it doesn’t make sense to fully encode that in a JDBC driver. Also, the demand just hasn’t come from the open source community yet. The enterprise produces this demand and Cloudera is going to try to meet it.

    2. Great question and it varies from site to site. I’ve seen one customer pipe the same data into both an ADBMS and HDFS, and then expose both to their end users. There’s no real good way to automate this right now: it’s up to the user to know which system is best for their query.

  5. Rich Holoch

    I was the 130th employee at Oracle back in 1984, and am so excited at what Hadoop brings to the table.

    For me – a Kimball Star Schema “disciple” who has also ventured into Teradata with Inmon Data factories, using Hadoop and HDFS + things like SQOOP and Pig make me think that Hadoop Technologies offer the best possible data staging area I have seen in my career).

    And by using something like MySQL as a meta data store – maybe (for the first time) I can finally switch from ETL to ELT and generate data load scripts that load from Hadoop HDFS to MySQL or Oracle by GENERATING the load scripts instead of coding them.

    This will be the first time I could do something like add or change a column in a Star Schema and not have to simulate an act of God to make that change.