The following article by Ciaran Dynes was reposted from the Talend blog with their permission.
As you may have read, Talend recently announced its support for Cloudera Altus, a newly released Platform-as-a-Service (PaaS) offering that simplifies running large-scale data processing applications in the public cloud. For us, supporting Altus at launch was the absolute easiest decision given that so many of our customers are looking to realize the cost, ease, and scalability benefits of the cloud. With our support for Altus, Talend users can develop and test their data pipelines within Talend Studio and directly deploy to their Altus cluster. It’s that simple.
Talend is the first integration vendor to support Cloudera Altus. Using Talend in conjunction with Cloudera Altus, companies can reduce their overall data processing costs, and simplify their big data deployments.
What problem does Cloudera Altus address?
Cloudera Altus is a managed service that makes it easy to run data engineering workloads in the cloud. There are 3 key things to highlight:
- Altus-deployed clusters run where the data is already stored and jobs executed against these clusters can read and write directly to Amazon S3 rather than copying or loading the data from somewhere else.
- Altus supports a usage-based consumption model, which allows users to manage their data processing more cost-effectively.
- Lastly, given that Altus is based on the enterprise Cloudera platform , existing Cloudera customers can easily migrate workloads between their on-prem and cloud environments.
Initially, Altus can deploy Cloudera clusters on AWS, but Cloudera plans to support other public clouds in the future such as Microsoft Azure.
Why the Altus and Talend Collaboration Makes Sense
As we noted in the press release, Altus is important for our customers making the move to the cloud because it allows companies to deploy big data projects dramatically faster, with far less operational support. Talend is then helping to extend this proposition by making it incredibly easy to build and quickly deploy intelligent data pipelines onto the Altus platform. With Talend, developers can fully focus on the design of their data pipelines without the need to write code, and at the same time, Altus takes care of the cluster management and operations.
Working with Altus and Talend
Let’s take a closer look at the solution and how you can leverage all of the capabilities of Altus for managing and monitoring big data applications in the cloud.
As noted earlier, Altus runs on AWS, which means jobs run against Altus-deployed clusters consuming data from the S3 object store. Once configured, the service uses your AWS credentials. For example, Altus can provision a Spark cluster based on the configuration defined by the user. Note: Clusters are transient, so you need to manage where data is stored, and take into consideration that IP addresses change between executions.
Cloudera Altus comes with a management console and a command-line interface that allow setup and configuration of your cluster and management of user accounts. For Talend users, you simply need to use Talend Studio to work with Altus. The seamless integration makes it easy to provision a new Cloudera Altus cluster by simply entering the desired configuration (number of worker nodes, type of Amazon instances, and location of the AWS S3 buckets…) from within Talend Studio.
Once the development is all set, the user submits a job to Cloudera Altus from within Talend Studio, which can then be monitored from the Altus Console with final results stored directly in S3.
Let’s take a simple example of analyzing data from SAP in Altus
First off, within the Altus Console, you can see Talend jobs that have been deployed already. An Altus Environment encapsulates AWS resources, such as which regions you are using. You can have as many environments as you wish, for example dev, test, prod.
In order to configure your cluster, you simply need to provide your AWS credentials. Altus takes advantage of AWS cross-account access roles to establish a trust relationship running actions within your account.
In this simple example, we aggregate very large volumes of customer data to process revenue consolidation at the end of the month.
You need to develop 2 jobs in Talend, that:
- Extract data from SAP and load it into Amazon S3
- Run a Spark job on Cloudera Altus to aggregate the SAP data
The first job will pull data from a local SAP instance and moves it to S3 in the cloud, which is a pre-requisite for using Cloudera Altus. Note: the design and running of this ingestion job can be easily orchestrated using Talend Integration Cloud.
The second job takes advantage of Altus, by processing the data (stored previously in the Amazon S3 bucket) using an Altus-deployed Spark cluster.
From a technical standpoint, Talend provides a graphical set of components to enable the developer to design the job. The job gets converted into a Spark program that runs natively on the Altus cluster, via the Cloudera Altus API. The user stores the Spark program on S3, which is then executed by Altus on the Spark cluster. When using Altus clusters the Amazon instances that were spun up are terminated just after the processing, leaving only the data processed by the job in an S3 bucket.
Once the job is submitted, it is possible to monitor it in the Cloudera Altus console. You can also see the type of job running, the activity log and job status. Further monitoring capabilities are exposed on the read-only Cloudera Manager console.
Once the job completes, you can see if the output file was successfully created in the S3 bucket via the AWS Management Console.
Altus will then terminate the cluster and clean up any resources. Altus ensures that all AWS instances are shut down and terminated once the processing is over. For troubleshooting and auditing, technical logs are also persisted to S3, so that even short-lived clusters are auditable.
So why is Talend + Cloudera Altus good news for you?
- • Talend and Cloudera Altus enables you to easily extend and automate big data and machine learning workloads to the cloud, as well reduce costs (pay as you go)
- • Developers can simply develop and test their jobs in Talend Studio and deploy directly to Altus, making it easy to build integration jobs quickly
- • Data Scientists will love Altus and Talend, as it allows them to quickly spin up a cluster for ad-hoc projects with a pay-as-you-go model, from their desktop
To learn more, visit Talend’s Cloudera Altus Integration Solutions page and Cloudera’s Altus page.