Today, we’re really excited to announce the latest innovation from Cloudera and Informatica’s partnership. Companies are increasingly moving their data operations into the cloud. With both companies focusing on helping customers derive business insights out of vast amounts of data, our new joint offering will dramatically simplify leveraging cloud-native infrastructures for big data analytics.
Last May, Cloudera announced Cloudera Altus, a new platform-as-a-service (PaaS) offering in the cloud for big data analytics, backed by the enterprise-grade Cloudera platform. Today, we’re pleased to announce a collaborative integration between Informatica Big Data Management and the Cloudera Altus service for data engineers. This integrated solution will enable customers to easily deploy large-scale data workloads in the cloud with intuitive end user workflows and minimal cluster operational overhead.
Cloudera Altus is a PaaS offering that enables you to analyze and process large-scale data sets in the cloud. Altus provisions clusters quickly and manages clusters cost-effectively. The current Altus release focuses on allowing end users to easily run large-scale data engineering workloads on the Altus platform, using MR2, Hive, Spark, or Hive-on-Spark.
Informatica Big Data Management (BDM) provides the most advanced data integration platform for big data analytics.
With Big Data Management on Cloudera Altus, users can focus on building and running data pipelines without worrying about cluster management. For example, organizations that wish to gain better visibility into large amounts of data can use this approach to process and deliver data swiftly and reliably for data analytics. Implementing big data engineering and analytic workflows have never been easier.
Data Engineering in the Cloud with BDM and Cloudera Altus
Use Informatica Big Data Management and Cloudera Altus to quickly build and deploy data engineering workflows in the cloud on top of a data lake while increasing productivity to quickly process and analyze data.
The following illustration shows a typical big data analytics solution implementation using BDM on Altus:
Step 1. Offload infrequently used data from the enterprise data warehouse and load raw data in batches to a defined landing zone in Amazon S3. This frees up space in the enterprise data warehouse.
Step 2. Collect and stream data generated by machines and sensors, including application and weblog files, directly to Amazon S3. Note that staging the data in a temporary file system or the data warehouse is no longer required.
Step 3. Discover and profile data stored on Amazon S3. Profile the data to better understand its structure and context. Easily add requirements for enterprise accountability, control, and governance for compliance with corporate and governmental regulations and business service level agreements.
Step 4. Parse and prepare data from weblogs, application server logs, or sensor data. Typically, these data types are in multi-structured or unstructured format, which can be parsed to extract features and entities and to apply data quality techniques. This allows you to easily execute pre-built transformations as well as data quality and matching rules in Cloudera Altus to prepare data for analysis.
Step 5. After cleansing and transforming data with Cloudera Altus, high-value curated data is written by Altus to Amazon S3 which can be directly accessed by Apache Impala for data analytics or Cloudera Spark for data science workflows.
Prototyping a Data Engineering Solution in the Cloud
During this next step, a prototype will illustrate how to deploy a data engineering solution using Cloudera Altus and Informatica Big Data Management. The example below demonstrates how to run Cloudera Altus on an Amazon ecosystem while starting an on-demand Spark job with Altus.
Creating a Workflow
Create a workflow in Informatica Big Data Management to implement a data engineering workflow. When using Informatica Big Data Management, you can create and terminate Altus clusters on demand. To create a cluster, specify the cluster configuration details, including the instance type and the number of cluster worker nodes.
Step 1. Informatica BDM creates an Altus cluster by using cluster configuration parameters specified by the user in the BDM workflow.
The following illustration shows an Altus cluster being created on the Altus cluster list page:
Step 2. Ingest data to Amazon S3.
Step 3. Prepare the data by cleansing and integrating with other datasets. This mapping task is fully integrated with Cloudera Altus and runs on the Altus cluster.
Step 4. Informatica BDM terminates the Altus cluster after the mapping processing to save costs.
Monitoring Spark Jobs
The Informatica monitoring console can be used to monitor Spark jobs that run on the Altus cluster. The following image demonstrates the Informatica monitoring console running Spark jobs on Altus:
The monitoring console provides a link to the Altus job history view below that lists all of the jobs that have been run:
The following illustration is the job details page showing a completed Informatica Spark job on Altus:
As a strategic partner to Cloudera, Informatica is delighted to announce this new solution that showcases Informatica Big Data Management working together with Cloudera Altus. Integrating Big Data Management with Altus will reduce the cost and complexity of managing Cloudera clusters in the cloud for data engineers and IT Administrators alike.
To learn more, visit informatica.com/ready/big-data-ready