The Platform for Big Data is Here

Categories: Hadoop Impala

It has been an exciting couple of days for new product announcements at Cloudera — exciting especially for me as the edges of the new platform for big data we have been talking about since Strata + Hadoop World 2012 come into focus.

Yesterday, Cloudera announced a strategic alliance with SAS. SAS is the industry leader in business analytics software, especially predictive analytics. Ninety percent of the Fortune 100 run SAS today. We have been working with SAS to make a number of its products work well with Cloudera including SAS Access, SAS Visual Analytics, and SAS High Performance Analytics (HPA). SAS HPA is an excellent case example of the future direction of Apache Hadoop as a data management platform:

  • Hadoop is a big opportunity for the data science user: no downsampling, unlimited model features, and freedom from the inflexibility of third normal form.
  • MapReduce-based data science has been useful to a point but is limited. Most data science users are familiar with SAS, not MapReduce, and many popular machine learning algorithms simply cannot be implemented in MapReduce.
  • SAS HPA runs natively on CDH, Cloudera’s Distribution Including Apache Hadoop. It leverages the same data on the same cluster that the MapReduce and SQL users use. It adheres to the same security model.

This is a win for the SAS users who might have previously felt alienated from this new Hadoop world, and it’s a win for Hadoop users who can get more value out of the repository of data growing in their clusters.

Today Cloudera is pleased to announce the general availability of Cloudera Impala 1.0: the industry’s first and only open source interactive SQL framework for the Hadoop platform. Since we announced the public beta in October 2012, Impala has made impressive strides. The product has advanced in functionality, quality, and performance. The third-party developer community has made significant new contributions. The user community has grown at a torrential pace. We’ve also received gratifying positive feedback from the analyst community. GigaOm Pro recently determined that Impala was the industry’s leading SQL-on-Hadoop offering.

Impala is another excellent proof point of the future of the platform for big data:

  • Hadoop is a big opportunity for the SQL user: explore structured data at full fidelity granularity, take advantage of Hadoop’s flexible schema to easily experiment with new data sets.
  • MapReduce-based SQL has been useful to a point but is limited. Most SQL users have been weaned on business intelligence tools that expect interactive SQL, not batch SQL, from the underlying engine. Also, things that were simple in a database like “cancel query” are not available in a MapReduce paradigm.
  • Impala runs natively on CDH. It leverages the same data on the same cluster that the MapReduce and SAS users use. It adheres to the same security model. It works within the same management framework. It uses the same schema and metadata catalog so objects don’t need to be ETL’ed into Impala for use.

This is a win for SQL users. Many of the most popular business intelligence tools have been tested to run on Impala, and we’ve been gratified to get great feedback on quality of experience from our BI partners. It’s also a win for the Hadoop users who can get more value out of the same repository of data.

Over the course of the next week we’ll be adding blog posts that flesh out the technical details of these two frameworks and how they can be used. Some of our partners will be doing the same. I want to emphasize the significance of these developments for customers, users, and partners everywhere. Today we have a scalable, flexible, 100% open source data management platform that lets users bring batch processing, interactive SQL, and math applications to a common repository of data running on industry-standard hardware. These frameworks are truly integrated parts of a larger data management platform with no costly specialized hardware or elaborate integration and data replication frameworks.

Hadoop’s versatility has become more important than its scalability and low cost as the principal reason for its growing popularity. The diversity of workloads and applications now available on Hadoop are broader than those of the legacy data management technologies that most organizations run today. Organizations will continue to use various other data management technologies besides Hadoop to take advantage of their unique strengths. Cloudera will continue to maintain excellent integration to all of them. Still, for leading-edge organizations, we see that Hadoop is increasingly becoming their central strategic platform. This is what we mean by The Platform for Big Data.

To learn more, read the Cloudera white paper, “The Platform for Big Data”.

Charles Zedlewski is Cloudera’s VP, Products.