How-to: Train Models in R and Python using Apache Spark MLlib and H2O

Categories: Data Science How-to Spark

Creating and training machine-learning models is more complex on distributed systems, but there are lots of frameworks for abstracting that complexity.

There are more options now than ever from proven open source projects for doing distributed analytics, with Python and R become increasingly popular. In this post, you’ll learn the options for setting up a simple read-eval-print (REPL) environment with Python and R within the Cloudera QuickStart VM using APIs for two of the most popular cluster computing frameworks: Apache Spark (with MLlib) and H2O (from the company with the same name). To compare these approaches, you’ll train a linear regression against a data set with known coefficients.

Invoke Spark from Python using PySpark

Starting with Spark 0.7.0, Spark includes PySpark (supported by Cloudera), the Python API for Spark. The PySpark API allows you to interact with Spark data objects including RDDs and DataFrames. In Spark, a Resilient Distributed Dataset (RDD) is the abstract reference to the data for a user. Starting with Spark 1.3.0, Spark includes DataFrames, which contain RDD data but are enhanced with schema and additional DataFrame functions. When using an RDD or DataFrames for analytics with Spark MLlib, each row can intuitively be understood as an observation when fitting a model.

Just like Scala and Java’s interface with MLlib, a Python RDD must be in LabeledPoint form to use the MLlib algorithms. Once you have invoked your Spark instance in Python using PySpark, you can run a linear regression as below:

Note that there is no requirement to identify the response variable in the call to train the model, that information is inherent in the LabeledPoint form.

Invoke Spark from R using R API (SparkR)

Starting with Spark 1.4.0, Spark includes SparkR (not currently supported by Cloudera). This R API also has DataFrames and it incorporates some of the local R data.frame functionality, just as PySpark allows some familiar column referencing by attribute or indexing. However, SparkR goes further than just data.frame to maintain familiarity with its localized form; R also retains the syntax to train models and predict by overwriting base R functions. (For those who had worked with SparkR previous to its inclusion in Spark, this was the reason for ensuring the SparkR library loaded last.) This approach lets R users maintain familiar syntax and even reuse local R object conveniences such as formula notation, which clearly has huge appeal.

However, it’s important to note that SparkR will not simply become a "distributed version of R." Algorithms within R are not all parallel-izable and therefore may not be able to take advantage of Spark’s distributed parallel-processing framework. SparkR is also lagging behind PySpark in the number of parallel distributed algorithms available. (At the time of this writing, the SparkR glm function is the only distributed, parallel machine-learning function available.) It may take some time for SparkR to achieve parity with PySpark regarding available machine-learning methods.

Here is the SparkR snippet of a generalized linear model:

Note that the glm call is syntactically identical to the local R form.

Connect to the H2O Application using Python H2O Library

H2O is open source software for data analytics at scale. A design objective of H2O is speed and flexibility to allow users to fit hundreds or thousands of potential models as part of discovering patterns in data. Similar to Spark, H2O runs an application with driver and worker resources. Like Spark, this application can be accessed via R and Python APIs. In addition, the H2O application also has a built-in GUI that is accessed via browser.

These applications can run within multiple frameworks including a standalone environment included with H2O, on YARN, or on Spark. After an H2O application is running, using the Python H2O library; you can connect, load data, and train the following glm:

Note that the response and feature fields accept python dataframe manipulations. In this context, dat is an H2O frame object.

Connect to the H2O Application using R H2O Library

The same H2O application instance created for the Python example above, could be used for this R example. Which makes it interesting that you can actually share models created from Python in R and vise-versa. After an H2O application is running, using the R H2O library, you can connect, load data, and train the following glm:

Note that formula isn’t accepted in h2o.glm, you will need to separately pass the response field and feature fields as a character vector. However, in addition to the R module user guide for H2O, H2O has further accommodated R users providing a booklet which provides example detail for items like handling factor columns.


We must acknowledge that cluster computing will inherently increase the complexity of modeling over localized modeling. If the data isn’t large enough to warrant cluster resources, data scientists should be content to continue using CRAN R packages or python scikit.

Interestingly, the approaches above mitigate their own adoption complexity in different ways:

  • PySpark has a pattern adoption of MLlib that makes the PySpark use of MLlib intuitive for Scala or Java MLlib users. This may also benefit PySpark in being able to more quickly adapt MLlib for PySpark.
  • SparkR has only the glm function. However to speculate from that single function, adoption is simplified by the syntax being identical for both SparkR and local R dataframes.
  • H2O simplicity isn’t immediately apparent. However, not evaluated above is the graphical user interface provided by Flow. Providing a GUI and parameter dropdown functionality definitely lowers the barrier to entry for the code-adverse. H2O is hosting to a larger audience than just R and Python coders.

Run these demos yourself using Vagrant!

Brad Barker is a Solutions Architect at Cloudera.


3 responses on “How-to: Train Models in R and Python using Apache Spark MLlib and H2O

  1. awanish kumar

    Really interesting article. I am running machine learning in Python but its not using more that 20% resources on standalone. Will it use all the resources on in cluster PySkark?

  2. Brad Barker

    The article was to highlight the differences in syntax, so I hadn’t mentioned launch methods, however…

    For SparkR and PySpark, Cloudera recommends Spark on YARN ( to leverage the existing resource manager and not compete for resources, but standalone ( is still possible.

    For H20, you again have the ability to run within YARN and specify resource arguments at launch. (

    In either case when using YARN, be sure to check your memory configurations. For CDH, this is done in Cloudera Manager YARN -> Configuration.

  3. satish

    We have models written in R (cran.R). How do we invoke these existing models written in cran.R from “Spark”? Will existing models in R run in distributed/scalable manner as MLLIB?
    Again we have “SparkR” is this R version provided by spark – which is different from cran.R. Does sparkR provide all ML libraries as provided by cran.R?

    Some of the models like “MaxEntropy” are not available in MLLIB but available in cran.R – can we invoke such models from spark?