Tools like Apache Spark bring scale to machine learning, and Cloudera Data Science Workbench brings Spark to data scientists. What happens when a data scientist wants to burst into the cloud to forge models at scale? Cloudera Altus, that’s what.
We’ve heard it a hundred times: big data is here, software is free and open, hardware is cheap and can be rented by the second, so start collecting data or be left behind. Data is great, but until it is refined and mined, it’s just a sunk storage cost. Thankfully, “data scientists” invented themselves to back up and answer some business questions from all of that data.
Some tools haven’t changed, just scaled up. Classic machine learning models have been adapted to huge data sets. New techniques have emerged as well that could only exist in an age of vast computation resources, like deep learning. Modeling on big data requires big compute, of course.
Interestingly, machine learning models can usually trade off time, resources, and quality. With distributed computing technologies like Spark, it’s easy to add more CPU and memory to make a job go faster. However it’s also often possible to add resources in order to make the model output better, by evaluating more possible models, and fine-tuning them more extensively.
This makes the cloud, and its promise of easy bursts of scale, ideal for large-scale machine learning. Once a model building process has been developed, and outputs a good-enough model, it can be run at huge scale on the cloud to find something even better.
The pieces necessary to do this already exist today in the Cloudera stack: Spark, Cloudera Data Science Workbench, and Cloudera Altus. With some care, they can be combined to deliver cloud scalability for modeling, even to data scientists working on an on-premise cluster.
Topic Modeling at Scale
Let’s begin with a problem that sounds like one a data scientist would solve: topic modeling. Given some text documents, what are the main themes that occur in them? One tool for this job is Latent Dirichlet Allocation (LDA). Like clustering algorithms, it puts documents into groups, and does so based on the words in the document. However it can assign many groups to a document, and also associates words with each group, so the more natural interpretation of the groups are “topics”.
Source: Fast Forward Labs
We’ll apply this to a suitably big-data-sized target: all public domain texts from Project Gutenberg. The goal will be to produce a set of topics, and the words and texts associated to them, that best explain the structure of all this literature. How many topics, and what they are, are to be decided by LDA and a lot of computing.
Model Development in the Workbench
It’s nothing that a little Spark code can’t solve, and this example will use its native Scala API. The Cloudera Data Science Workbench provides an interactive environment for developing data science pipelines, including Scala-based Spark jobs like this one, on a cluster. It’s not unusual to have access to compute and data on a long-running lab or production cluster, sufficient for exploration and prototyping, but later want to temporarily bring to bear much more compute without provisioning that resource full-time in the main cluster.
Fortunately, this work is done already, at https://github.com/nishamuktewar/altus-lda-example. In fact, those who are only interested in Altus, or don’t have access to a Workbench and a Cloudera cluster, can skip the steps in this section.
For those with access to such a cluster, create a new project based on this Github repository, in order to follow along on your cluster. Click “New Project”, enter “altus-lda-example” for “Name”, and “Git” for “Initial Setup”. Enter the repository URL above and click “Create Project”.
The cluster resources defined in spark-defaults.conf should be tuned to better match your cluster. If in doubt, delete the file and let it use cluster defaults. Run a Scala session with at least 8GB of RAM.
A little bit of setup is required. See the README.md for instructions to obtain a preprocessed set of texts from the Project Gutenberg archive. This will mean using the Terminal feature of the Workbench to upload data straight from the internet to HDFS.
Open src/main/scala/altus/AltusLDAExample.scala in the editor. It’s a Scala-based Spark program that’s ready to run in the shell — almost. It’s carefully written to compile into a JAR file using Apache Maven as part of a normal stand-alone Scala project, but also be usable with the Workbench’s interactive interpreter. This requires a little creativity, but it means the finished product can be shipped to the cloud with Altus. More on this later.
For now, note that there are two major blocks of code between START and END comments. Select and run both of them in turn.
Number crunching begins, as the job builds a couple different LDA models on the data set, evaluates them, and chooses the best one. For the interested, the source code contains commentary and detail on how an LDA pipeline is constructed and run in Spark MLlib, including preprocessing steps like TF-IDF.
It then prints some detail about the topics and saves the model. Follow the link to the Spark UI at the top right to see detail about the job’s execution. It could take half an hour, so feel free to take a coffee break. Congratulations: you’re doing data science.
Notice topic output like:
It’s a not-incoherent group of 19th-century books, several about the American Civil War. That is a good sign, but it’s not obvious that a topic involving “sidenote” and “walla” has any narrow, interpretable theme. More topics might be needed.
The best model is (optionally) saved on HDFS and can be loaded and used by another team in production, to assign topics to 19th-century fiction as it arrives in the cluster in real-time. Some would say, “mission accomplished.” However, the model was also only trained on a small sample of the input data, and only a few small topic counts were tried. The reported perplexity for a small number of topics was high.
If there’s anything cooler than modeling on 10% of data, it’s modeling on 100% of data, or more. Also, it’ll be obvious to the data scientist that the job should try more hyperparameters (topics, or the k hyperparameter). It’s trivial to change the arguments to do so, but, this small test already ran for half an hour on a small cluster. A comprehensive hyperparameter search over all of the data would take hundreds of hours to run. We don’t have that kind of time; business value is on the line.
Easy Scale with Altus
Cloudera Altus is a platform-as-a-service that makes it easy and cost-effective to process large-scale data sets in the cloud. Altus’s first offering is a Data Engineering service, which makes it easy to run something like a Spark job in the cloud. It manages setting up and tearing down a transient cluster to accomplish this. With Altus, the same workload that ran on an on-premise Cloudera cluster can run on a huge, short-lived Cloudera cluster on a cloud provider like Amazon Web Services. Here, the same code that ran successfully on a small cluster will run at scale on AWS.
Normally, this kind of operational data engineering job might be handled by an engineering or devops team, not a data scientist. The code that was executed above was, however, structured as part of a normal software build that can produce a runnable JAR file, suitable for execution with Altus. Another team could take this output and deploy the modeling job with no further modification. This is already a big step forward from the norm, where engineering teams often completely rewrite the output of the data science team and hope it’s roughly the same.
This example will explore actually running on Altus from within the data scientist’s environment, the Cloudera Data Science Workbench. It’s a little unconventional, but quite possible.
Running a job on Altus currently means compiling the Spark job code to a JAR file. This means Scala- or Java-based jobs only for now; good thing it was written in Scala! The README.md explains how to compile the code within the Workbench using Maven, if desired, although a precompiled JAR is already available at: s3a://altus-cdsw-lda-example/altus-lda-example-1.0.0-jar-with-dependencies.jar
The source data, like this JAR, needs to be copied to S3 in order to be available to the Altus job running on AWS. This raises an important point for workloads that are moving from an on-premise cluster: Altus works well for jobs whose input and output lives in the cloud. Here, it fits well, because the input is not sensitive and not huge (gigabytes) and so can be copied to the cloud, and the job is primarily compute-intensive. In fact, the preprocessed source data has already been uploaded to S3 and made public: s3a://altus-cdsw-lda-example/gutenberg/
Running on Altus
Get started by logging in to Altus. Register, if you do not already have an account.
Altus requires one-time setup in order to give access to resources under your cloud provider’s account. A configured Environment is required, and this would usually have already been created by an infrastructure team.
If you do not already have an Environment available, then set up AWS and Altus as outlined in the README.md under “Set Up AWS” and “Set Up Altus”. Please first refer to AWS Account Requirements as well to ensure your AWS account allows the operations Altus needs. The rest of this example will assume there is an environment called “AltusLDAExample” in region us-east-2 (US East – Ohio) and that you have an SSH key and the AWS access and secret key.
Return to the main Altus pane and choose “Jobs” from the left, and choose “Submit Jobs”. Fill out the details as follows:
- Job Settings
- Submission: Single job
- Job Type: Spark
- Job Name: AltusLDAExample
- Main Class: altus.AltusLDARunner
- Jars: s3a://altus-cdsw-lda-example/altus-lda-example-1.0.0-jar-with-dependencies.jar
- Application Arguments:
- Spark Arguments:
–driver-cores 1 –driver-memory 4g
–executor-cores 14 –executor-memory 16g
–conf fs.s3a.access.key=”[AWS Access Key]”
–conf fs.s3.awsSecretAccessKey=”[AWS Secret Key]”
- Cluster Settings
- Cluster: Create New
- Uncheck “Terminate cluster once jobs complete” if you wish to try submitting several jobs, but don’t forget to shut it down
- Cluster Name: AltusLDAExample-cluster
- Service Type: Spark 2.x
- CDH Version: CDH 5.13 (Spark 2.2)
- Environment: AltusLDAExample
- Cluster: Create New
- Node Configuration:
- Worker: 3 x c4.4xlarge
- Compute Worker: 5 x c4.4xlarge
- Purchasing Option: Spot, $0.80 / hour
- SSH Private Key: (
.pemfile from SSH key above)
- Cloudera Manager: (pick any credentials)
- SSH Private Key: (
This runs an 8-node cluster of compute-heavy nodes, of which 5 are purchased at a discount as spot instances. Note that this also requires entering S3 credentials on the command line, which is simple but not the most secure option. See the Spark S3 documentation for more secure options.
It may take 10 minutes to provision the cluster and start the job. Once it is running, the Job can be accessed from “Jobs” at the left.
It’s possible to find the application logs in S3 at a path like altus-lda-example-log-archive-bucket/AltusLDAExample-cluster-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/hdfs/logs as well (path will vary with the environment name and customized logs bucket).
Note that it’s possible, with the Altus CLI, to enable access to the cluster’s Cloudera Manager instance, for a full view into its activity, including the Spark UI, if desired. The SOCKS Proxy CLI command provided by Altus takes somewhat more setup, including opening SSH from your local machine in the Security Group that Altus created, but can be very useful for debugging.
Don’t forget to shut down the cluster, if you chose to not terminate it after the job completes.
The job will still take hours to complete with 8 Compute Workers. The job is heavily parallelized (models, each using distributed resources, are themselves built in parallel) and this particular job can probably benefit from tens of compute workers before returns diminish. Depending on your patience and budget, feel free to increase the number of Compute Workers and executors above.
Once you get a feel for the job, and have toured Altus, also feel free to cancel the job. If it runs to completion, more coherent topics appear in the driver log, like:
Spark has brought the possibility of large scale modeling to data scientists, and tools like the Cloudera Data Science Workbench have made it easier to access Spark on a cluster. Now, it’s even possible to access extra scale in the cloud for a modeling job developed on-premise, using Cloudera Altus. If you’re modeling successfully on an on-premise cluster, consider whether you can upgrade your model output with a burst of scale from the cloud.
Interested in learning more about Altus? Get started with the documentation.
Cloudera Data Science Workbench is available to Cloudera customers who license Cloudera Enterprise Data Hub and Cloudera Data Engineering editions.