We are excited to announce the general availability of Cloudera Altus SDK for Java to programmatically leverage the Altus platform-as-a service for ETL, batch machine learning, and cloud bursting. Altus empowers customers and partners alike, to run data engineering workloads in the cloud, leveraging cloud infrastructures such as AWS. Cloudera Altus also provides the ability to create data engineering pipelines using both a web console and CLI.
Cloudera Altus SDK for Java was developed to provide easier programmatic access with the popular Java programming language so that users can automate their data engineering workloads. Users can now write code to create and destroy clusters, submit data engineering jobs and monitor status using Java. Figure 1 shows the architecture and process flow of Cloudera Altus.
Let us explore how to run data engineering workloads using Cloudera Altus SDK for Java utilizing the sample project found at https://github.com/cloudera/altus-sdk-java-samples. We’ll explore how to create clusters and submit jobs. Be sure to check out the Github project for other activities such as monitoring job status.
Because Altus is a Platform-as-a Service (PaaS) offering, make sure you have the prerequisite authorization in order to allow Altus to provision and manage Cloudera clusters in your AWS account. To verify your authorization, log into Altus and select Altus Data Engineering. If your organization does not yet have access, please contact Cloudera sales.
Next create an Altus environment if you do not already have one. In Altus, we define an “Environment” as an encapsulation of cloud provider resources needed to deploy on the Cloudera cluster such as security groups. Typically, users will have a particular set of resources provided by the IT group. The quickest way to get up and running is to use the environment quickstart which will create all the AWS-side resources for you.
A bit more setup is required to run the sample program. In the SampleResources.ini properties file, update
EnvironmentName will be updated with the environment you just created above,
outputLocation needs to be updated with a new folder in S3 to store your results and
ssh_public_key_location is the location of your SSH key on your local machine. Refer to the Public SSH key property for additional information.
See README.md for setup instructions. The sample project is based on the Altus tutorials. The SparkAllInOneIntegration.java file demonstrates how to create and configure an Apache Spark cluster, submit a job analyzing Medicare procedure codes in the publicly available data, and then terminate the cluster once done.
After setting the necessary properties in the request object, which has been omitted for brevity, Figure 2 shows that the following lines of code can be used to create the cluster, queue the job for submission, submit the job once the cluster is successfully created and then upon completion, terminate the cluster. In Figure 2, line 104 articulates how to automate cluster termination.
This process can take around 10 mins – good time to get some coffee and catch up on the Altus documentation! Progress can be viewed on the Altus console as shown on Figure 3 or using the Altus SDK for Java as shown in the pollClusterStatus method on Figure 4.
Upon successful completion, the results are stored in the S3 location specified in the
outputLocation option you specified above.
The sample project also contains examples on how to create and submit an Apache Hive, Apache MapReduce and Apache Spark jobs.
Cloudera Altus SDK for Java has a roadmap that reflects new functionality being added to the Altus web console. The source code for Cloudera Altus SDK for Java is available at Altus SDK Java Github and the jar file is located here .
To get started with Altus, visit us at tiny.cloudera.com/altus.