How the SAS and Cloudera Platforms Work Together

Categories: CDH Data Science Hadoop Impala

On Monday April 29, Cloudera announced a strategic alliance with SAS. As the industry leader in business analytics software, SAS brings a formidable toolset to bear on the problem of extracting business value from large volumes of data.

Over the past few months, Cloudera has been hard at work along with the SAS team to integrate a number of SAS products with Apache Hadoop, delivering the ability for our customers to use these tools in their interaction with data on the Cloudera platform. In this post, we will delve into the major mechanisms that are available for connecting SAS to CDH, Cloudera’s 100% open-source distribution including Hadoop.

SAS/ACCESS to Hadoop

SAS/ACCESS provides the ability to access data sets stored in Hadoop in SAS natively. With SAS/Access to Hadoop:

  • LIBNAME statements can be used to make Hive tables look like SAS data sets on top of which SAS Procedures and SAS DATA steps can interact.
  • PROC SQL commands provide the ability to execute direct Hive SQL commands on Hadoop.
  • PROC HADOOP provides the ability to directly submit MapReduce, Apache Pig, and HDFS commands from the SAS execution environment to your CDH cluster.

The SAS/ACCESS interface is available from the SAS 9.3M2 release and supports CDH 3U2 as well as CDH 4.01 and higher.

SAS/ACCESS enables users familiar with the SAS interface to operate seamlessly on data stored in Hadoop while bringing the power of SAS to these data sets.

SAS High Performance Analytics (HPA)

SAS HPA brings the ability to create analytical models using entire data sets, without down-sampling, while quickly iterating over multiple models to find the right solution for the problem. SAS HPA is designed to provide blazing-fast response while iterating on models that have been implemented from the ground up to be parallelizable.

Built on a distributed, in-memory architecture that scales with cluster size, SAS HPA is a perfect fit for the Cloudera system architecture:

  • SAS agents are deployed on each node of the CDH cluster.
  • SAS agents, as required by users to perform analytics, load data sets into memory from the HDFS filesystem.
  • Once data is loaded, the agents communicate with each other to execute analytical queries on the data and return results straight from memory.
  • As new nodes come online, SAS agents can take advantage of additional resources available in the cluster.

SAS HPA can operate in parallel with MapReduce and Cloudera Impala to provide another powerful computational framework that can operate on data stored in CDH.

SAS HPA includes support for high-performance statistical methods, data-mining operations, econometric models, text mining, optimization techniques, and many more types of models.

A full list of supported features is available here. SAS HPA is supported on CDH 4.01 and higher.

SAS Visual Analytics

SAS Visual Analytics provides rich data visualization capabilities and sophisticated analytic techniques to end consumers directly – whether a business user with limited technical skills or a data scientist. Flexible and sophisticated reporting, forecasting, and charting on all your data can be generated in seconds using in-memory computation capabilities of SAS Visual Analytics.

SAS Visual Analytics is built on the SAS LASR Analytics Server, a high-performance in-memory engine that can leverage the capabilities of a CDH cluster:

  • SAS LASR daemons run on each node of your Hadoop cluster.
  • Based on administrative policies, data is loaded into LASR daemons from the HDFS filesystem.
  • As users log in to perform new analyses, the analytics engine distributes computation across the nodes, which generate results and return the results to the visualization layer for presentation.
  • As new nodes come online, SAS LASR daemons can take advantage of the resources available on these nodes as well.

SAS Visual Analytics is supported on CDH 4.01 and higher.

SAS Data Management Advanced

  • SAS Data Management Advanced includes support for CDH as a data source or data target for Data Integration Studio.
  • Besides standard transforms for data on Hadoop, the Studio also provides ability to integrate Apache Hive, Pig, MapReduce, and HDFS commands as part of a data flow.
  • SAS DataFlux tools can be used on data in HDFS since CDH is treated as another data source by SAS.
  • SAS Metadata Server can be used to record metadata based on data in CDH.
  • Lineage tools are also available within SAS Data Management Advanced so that all SAS processing that is done on Cloudera platforms can also be tracked.

The combination of the capabilities present in SAS Data Management advanced makes CDH easy to integrate as a data source or sink in a SAS management environment with minimal disruption, and makes the capabilities of the SAS Data Management suite available to data on Hadoop as well.

Overview of SAS Analytics on Cloudera

SAS and Cloudera Looking Forward

In summary, SAS provides a rich and familiar set of data analytics tools that are compelling to users who wish to take advantage of the storage and computational capabilities unlocked by a Hadoop cluster. New products such as SAS HPA and SAS Visual Analytics provide unparalleled performance built on the highly scalable architecture of CDH, providing quicker analysis and insight into your data for solving crucial business problems.

While we’ve made great progress in enabling support for SAS on our platform, Cloudera and SAS continue to work closely together to develop richer integrations that combine the power of Hadoop with the capabilities of the SAS product suite. Stay tuned for more updates in the coming months!

Jairam Ranganathan is a product director at Cloudera.