Extending the Data Warehouse with Hadoop

Categories: General Impala Use Case

“Are data warehouses becoming victims of their own success?”, Tony Baer asks in a recent blog post:

“While SQL platforms have steadily increased scale and performance (it’s easy to forget that 30 years ago, conventional wisdom was that they would never scale to support enterprise OLTP systems), the legwork of operating data warehouses is becoming a source of bottlenecks. Data warehouses and transactional systems have traditionally been kept apart because their workloads significantly differed; they were typically kept at arm’s length with separate staging servers in the middle tier, where ETL operations were performed.

Yet, surging data volumes are breaking this pattern. With growing data volumes has come an emerging pattern where data and processing are brought together on the same platform. The “ELT” pattern was thus born based on the notion that collocating transformation operations inside the data warehouse would be more efficient as it would reduce data movements. The downside of ELT, however, is that data transformation compute cycles compete for finite resource with analytics.”

And this competition for resources is only getting worse as data volumes grow and more users demand access to business information. Data warehouses become saturated, critical workloads back up, SLAs are missed, BI queries take longer, and the high-end analytic databases are effectively unable to take on new high-value analytic workloads, being consumed with batch processing. The result? A constrained user experience, little room for new projects, and an expensive expansion upgrade path.

Until recently there hasn’t been a cost-effective solution to these problems. But today, a wide range of customers are using open source Apache Hadoop to rationalize and complement their existing data warehouses to reduce costs, improve performance, and enable new insights.

Yet as Tony points out, Hadoop’s value has extended beyond just affordable, scalable storage and processing:

“[Hadoop] presents a lower cost target for shifting transform compute cycles. More importantly, it adds new options for analytic processing. With SQL and Hadoop converging, there are new paths for SQL developers to access data in Hadoop without having to learn MapReduce. These capabilities will not eliminate SQL querying to your existing data warehouse, as such platforms are well-suited for routine queries (with many of them carrying their own embedded specialized functions). But they supplement them by providing the opportunity to conduct exploratory querying that rounds out the picture and provides the opportunity to test drive new analytics before populating them to the primary data warehouse.”

Cloudera believes that the future of Hadoop is as a Platform for Big Data that will complement, not replace, existing data management systems, enabling new ways of interacting with large and diverse data sets. Last week, for example, Cloudera announced the general availability of Cloudera Impala, the industry’s first and only open source interactive SQL framework for the Hadoop platform. Through innovations like Impala, Hadoop presents exciting new opportunities for the enterprise.

Want to hear more? Join us for a webinar on May 9 with Tony Baer, Principal Analyst at Ovum, and get insights from the recently published whitepaper, Hadoop: Extending Your Data Warehouse. Upon registration you will get access to the whitepaper.

(5/12/2013 update: This webinar has lapsed, but you can watch the replay here.)