Sustained Innovation in Apache Spark: DataFrames, Spark SQL, and MLlib

Categories: CDH Spark

Cloudera has announced support for Spark SQL/DataFrame API and MLlib. This post explains their benefits for app developers, data analysts, data engineers, and data scientists.

In July 2015, Cloudera re-affirmed its position since 2013: that Apache Spark is on course to replace MapReduce as the default general-purpose data processing engine for Apache Hadoop. Thanks to initiatives like the One Platform Initiative, Spark continues its march toward comprehensive enterprise readiness, and its adoption is rapidly spreading across industries and diverse use cases. More important, there is continuous and sustained innovation in the Spark ecosystem, and in this post, I’d like to highlight two components that exemplify this innovation: Spark SQL and MLlib.

Spark SQL & DataFrames: A Big Step Forward for Application Development

The popularity of Spark boils down to three key aspects: the ease of programming in Spark, its flexibility, and its significant performance advantage over preceding batch-processing frameworks. With Spark SQL, Spark makes a major leap forward along these three dimensions.

The fundamental innovation with Spark SQL and the DataFrame API is that there’s now a separation between the logical plan and the physical plan in Spark; on other words; there is an intelligent optimizer between the user facing API and the underlying execution layer. This separation provides an opportunity to simplify the API layer, while simultaneously improving performance.

Ease of Programming

A DataFrame is a distributed collection of data organized into named, typed columns. It is conceptually equivalent to a table in a relational database or a data frame in R or Python Pandas. Coding with records that have named, typed columns is always easier than coding with arbitrary Java, Scala, or Python objects.

Spark SQL is essentially the processing of these DataFrames via SQL statements. SQL is arguably the most widely used “programming language,” and it makes big data processing on Spark available to a wider audience beyond just big data software engineers.

Flexibility: Seamlessly Mix SQL with Regular Spark Operations on RDDs

The ability to mix SQL with Spark RDD operations is a significant value proposition of Spark SQL. SQL has a broad audience and provides an easy way to author computation of data that has structure. However, SQL does not provide the full flexibility that one would get while writing code in Scala, Java, or Python. In Spark SQL, it is very easy to convert DataFrames to RDDs and vice versa. Thus, you can leverage the ease of SQL where applicable, and fall back to regular operations on RDDs for more complex processing that can not be expressed in SQL.

For the operations authored in SQL, you automatically get the performance optimizations of the query engine, which is quite compelling even if you are an expert coder. Let us take a simple example, for illustrative purposes: If you write code to join two RDDs, and then filter the joined RDD, the code will be executed in the order in which it was written. However, if the same computation is expressed as SQL, the SQL engine will automatically determine the optimal ordering of operations. If applying the filter operation before the join is more efficient, the query engine will automatically do so. The SQL engine comes with many substantial optimizations under the hood (which we touch upon in the next paragraph), which you get by default when you embed SQL in your Spark code.

Performance Optimizations

The Spark SQL query optimization engine converts each SQL query to a logical plan, which then gets converted to multiple physical execution plans, and then the most optimal physical plan is selected for execution. Some of the optimizations applied by the query engine include:

  • Compact binary in-memory data representation, leading to lower memory usage and reduced reliance on the JVM garbage collector
  • Code generation to convert expressions in SQL to Java byte code
  • Predicate pushdown to reduce I/O
  • Optimal pipelining of operations
  • Optimal join strategy selection (sort-merge, hash, broadcast join, etc.)

Interoperability with Other Hadoop Engines

As an integrated part of Cloudera Enterprise and the Hadoop ecosystem, Spark SQL benefits from the shared data, metadata, and enterprise features necessary to power production workloads. Spark SQL complements the capabilities of other best-of-breed access frameworks, such as Impala, Apache Hive, and Apache Solr, so you always have the right tool at your disposal for a variety of workloads.

In Hadoop, data with schemas are often registered in the Hive Metastore and shared with tools like Hive and Impala for interoperability as data is processed (with Hive) and opened up for broader BI analytics (with Impala). Cloudera has worked to ensure that this interoperability holds true for Spark SQL as well. Using the HiveContext, Spark applications can seamlessly read from or write to Hive tables. Now you can use Spark SQL for fast batch processing of Hive tables, and Impala for low latency BI on the same tables.

Another noteworthy feature of Spark SQL is the fact that it can process SQL queries authored in the HiveQL dialect, as well as work with Hive SerDes and UDFs. This enables Hadoop practitioners that are well versed in HiveQL to continue to use it in Spark SQL. Spark SQL supports most HiveQL features, and the supported HiveQL features are documented in the Spark Programming Guide.

MLlib: Democratizing Distributed Machine Learning on Big Data

From its early days, Spark was a big hit in the data science and machine-learning community. Its ability to cache data and efficiently run iterative algorithms made it an ideal platform for machine-learning algorithms, which almost always require some iterative processing in the model-training phase. The community capitalized on this trend and decided to add a library of machine-learning algorithms, called MLlib, as part of the Spark project. MLlib has seen a steady addition of new algorithms, which have now been battle-tested and validated in real-world settings, on real-world big data. It contains a significant majority of mainstream machine-learning algorithms, a vibrant community, as well as books (authored by Cloudera’s own Data Science team) that describe how to use it with real-world examples.

Some popular algorithms in MLlib include:

  • Classifiers like logistic regression, boosted trees, random forests
  • Clustering algorithms like k-means and Latent Dirichlet Allocation (LDA)
  • Recommender system based on Alternating Least Squares (ALS)
  • Dimensionality reduction algorithms like Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
  • Frequent-pattern mining algorithms like Association Rule Mining

Feature-engineering Toolkit

In machine learning, the better part of an engineer’s time is spent in building features that feed into the model. A model is only as good as the features that are fed into the model. (After all, garbage in garbage out!) Some machine-learning practitioners would go so far as to say that the choice of algorithm (logistic regression, neural net, decision tree, etc) isn’t nearly as important as the feature engineering and feature-selection process.

MLlib comes with a pre-built collection of modules for feature engineering (and let’s not forget that, even without pre-built libraries, Spark by itself is ideal for feature engineering on big data). Popular feature engineering modules include:

  • StandardScalar and Normalizer for Feature Standardization
  • Word2Vec to convert textual data into a distributed vector representation that can be fed into a machine-learning model
  • TF-IDF to compute the importance of a term to a document in the corpus
  • ChiSqSelector to identify and select the most discriminating features

The productivity lift of staying within Spark for the end-to-end flow of data preparation, data transformation, feature engineering, model training, and model selection is incredibly compelling. MLlib extends Spark’s ease of use and performance to the machine learning sphere, effectively providing an open source platform that democratizes machine learning on big data.


Spark SQL and MLlib demonstrate continued momentum and innovation in the Spark ecosystem, and open the doors to big data processing for a wider audience and enable new use cases. Learn more about Spark and Hadoop at, give Spark a spin inside Cloudera’s latest release of CDH (CDH 5.5), and check out tutorials to help you get started.

Anand Iyer is a senior product manager at Cloudera. His primary areas of focus are platforms for real-time streaming, Spark, and tools for data ingestion into the Hadoop platform.