Since the launch of sparklyr, working with Apache Spark in Apache Hadoop has become much easier for R users. sparklyr contains a dplyr interface into Spark and allows users to leverage crucial machine learning algorithms from Spark MLlib and H2O Sparkling Water. This greatly reduces the barrier of entry for R users in adopting Spark as a tool for big data and should go a long way in enabling R workloads to migrate to Hadoop.
Data Scientists have been on the front lines in the lead toward cloud infrastructure. Their exploratory, transient, and bursty workloads tend to be much better suited for the cloud than an always-on bare metal deployment. At the same time, there is a desire to minimize the management burden of cloud environments as much as possible while reducing costs. In this blog post we’ll demonstrate how to automate a sparklyr environment on AWS using Cloudera Director.
Since sparklyr only needs libraries on the node on which it is running, we’ll spin up a cluster that contains an edge node with everything it needs. The main sequence of events will be:
- Install R + related dependencies
- Install RStudio Server
- Create R user and set environment variables
- Install sparklyr
When the cluster is ready, the user will be able to login to the RStudio Server and start working. By leveraging Cloudera Director bootstrap scripts, we can automate this entire process to produce sparklyr environments on demand for data science teams.
Creating a sparklyr Cluster using Cloudera Director Server UI
Follow the Cloudera documentation for installing the Cloudera Director server. Although we’ll use AWS as an example here, Cloudera Director can work with Google Cloud Platform and Microsoft Azure as well. Full information on deploying on AWS can be found in our documentation, but you’ll want to do the following:
- If an environment does not already exist, create one.
- If creating a new environment, you’ll be prompted to create a host for Cloudera Manager. This host will deploy the rest of the environment. Follow the instructions in our documentation. If you already have a Cloudera Manager in an environment that you would like to deploy a sparklyr cluster, you can simply add a new cluster to that instance of CM.
- On the next screen you’ll create the actual cluster for sparklyr. Name the cluster, and choose “Core Hadoop with Spark on YARN” or “All Services” as your services. For the Instance Groups, specify the number of master and worker nodes you’d like, and include at least 1 gateway. This is where you sparklyr will be installed.
- Each Instance Group needs a template. Refer to the documentation for assistance in setting that up, but the sparklyr gateway needs a separate template that includes a bootstrap script that sets up sparklyr. Use the script located here as the bootstrap. You can either upload the file or cut and paste it into the UI.
5. Your setup should look something like below. Click Continue and director will begin installing your cluster.
Note that you can easily strap on a sparklyr node to an existing cluster by navigating to the cluster in the Director UI and selectin to “modify” it. Here you’ll be able to add a new group, where you can specify the newly created sparklyr template and deploy the new node.
Once the cluster is finished, log onto your gateway node and run the following:
sudo -u hdfs hadoop fs -mkdir /user/rsuser
sudo -u hdfs hadoop fs -chown rsuser:rsuser /user/rsuser
This completes the setup needed in order for rsuser to be able to use sparklyr.
Now navigate to RStudio at <sparklyr-gateway-hostname>:8787 and login with rsuser/cloudera. All the packages you need to start working with sparklyr are already installed, so you can immediately start working.
Creating a sparklyr Cluster using Cloudera Director Client
Cloudera Director also contains a client that allows your to spin up clusters via the command line. It doesn’t require use of the Director UI (which some may prefer), and it also allows us to leverage post-installation scripts, which currently are not exposed in the Director UI. This means we can automate the creation of /user/rsuser directory in HDFS as well. The Director client requires a configuration file that essentially defines exactly the same type of information that we put into the web UI. A template for a sparklyr deployment is located here.
Once the director client is installed, simply run the following command to kick off the installation of a cluster:
cloudera-director bootstrap-remote director-sparklyr.conf –lp.remote.username=admin –lp.remote.password=admin –lp.remote.hostAndPort=<director-server-hostname>:7189
When Director is finished spinning up the cluster, you can log directly into <sparklyr-gateway-hostname>:8787 and start working with R.
By combining Cloudera Director and sparklyr, R users are able to easily and quickly create environments that leverage Spark and are able to harness the power of big data. For more information on working with sparklyr, visit the official page.
Learn more about Sparklyr in an upcoming webinar with our partner R Studio