Learn how to use Cloudera to spin up Apache Hadoop clusters across multiple cloud providers to take advantage of competing prices and avoid infrastructure lock-in.
Why is a multi-cloud strategy important?
In the early days of Cloudera, it was a fair assumption that our software would be running on industry-standard servers that were purchased, owned, and operated by the client in their own data center. In the last few years, however, our clients have been increasingly turning to the cloud for agility, ease of scaling, faster time to provisioning, reduction in data-center footprint, and overall lower TCO of their applications – and this includes their Big Data platforms. As they begin to survey the cloud landscape, most begin by deciding which provider to select and trying to ascertain the costs of migration.
Most enterprises are choosing from among the three leading cloud providers: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform. While all three vendors offer competitive pricing for a cloud based infrastructure, it’s important to note that there are some small differences in their models. For example, Amazon offers spot pricing, and Google preemptible instance pricing, which can reduce costs of quick, non-critical, or fault-tolerant workloads. On the other hand, Google and Microsoft charge by the minute while Amazon rounds to the hour. The point is it’s important to understand each vendor’s unique pricing model before making an infrastructure selection.
Our demo ran for approximately 20 min on each provider. Due to Amazon’s hourly pricing vs. Google’s and Microsoft’s minute pricing Amazon turned out to be the most expensive. Also, given the short duration of the workload, spot or preemptible instances could be used to greatly reduce the cost. This is an example to showcase that a cost analysis of your workload is needed to determine the most cost-effective infrastructure.
Another consideration is the cloud provider’s reliability. How do you avoid becoming a victim of a potential error on the cloud provider’s behalf? On February 28th, 2017, Amazon S3 (which offers a 99.999999999% durability1) was down for almost 4 hours2. If all your production data and applications were on AWS, that could have been a costly error. How do we safeguard against such a disastrous scenario? Wouldn’t you feel safer knowing that within a few minutes you could move your production applications between different cloud infrastructure providers?
How to minimize cost and risk of selecting a cloud provider
Cloudera recognizes that tying down to only one cloud provider can be a costly commitment, and in some cases a conflict of interest. Without flexibility to migrate workloads, a client is at the mercy of a given provider’s decision to increase cost or charge models. Additionally, many of our clients have lines of business that intersect or even compete directly with some of the offerings from the cloud providers. Avoiding lock-in is key for these clients.
To ameliorate this risk, we have made every effort to ensure that running Cloudera in the cloud is a seamless and transparent experience, regardless of which provider you select. In most cases, we encourage clients to diversify their cloud investment across at least two of the three major platforms.
Cloudera Director Dashboard showing three clusters, one on each of the three leading cloud providers
Cloudera Director is the orchestration tool for provisioning instances and deploying and managing Cloudera’s platform in any cloud environment. Cloudera Director provides reusable instance profiles and reusable cluster profiles via configuration files. One Cloudera Director instance can sit in any cloud provider, or even on-premises, and launch a Cloudera cluster in any of the three leading cloud providers if it has the proper credentials into the relevant cloud environments. Cloudera Director provides a single-pane of glass (that can be accessed programmatically as well) for cluster administrators to manage the lifecycle of long-lived or transient clusters.
Clients can leverage the capabilities of Cloudera Director combined with the infrastructure-agnostic design of CDH to disperse their cloud deployment across all three of these platforms. Cloudera’s leading distribution including Apache Hadoop, Apache Spark, and Apache Impala (incubating) remains consistent, regardless of which environment is selected for the infrastructure. This means that any application executed on one cloud running Cloudera will run functionally the same when that code is executed on a different cloud, as well as on-premises or on private cloud. As a result, the same job that you may be running today in your data center can easily port to any cloud infrastructure with minimal or zero additional development effort.
This releases the burden of cloud vendor lock-in and gives our clients the flexibility to select which provider is most appropriate for them for any big data application at the moment in time it needs to run. This is in stark contrast to the solutions that are provided by the respective cloud providers. It would require a degree of manual effort (or, at least, be economically impractical) to take a job running on EMR, Kinesis, and Athena on AWS and port it to run on HDInsight on Azure. Once a client begins development in a cloud provider’s Hadoop distribution, there is little chance of moving that workload without significant cost and effort. Think of the effort of porting applications off the mainframe, but for the new generation!
Demonstration video of multi-cloud deployment
In the video below we have documented how to leverage one instance of Cloudera Director to deploy three different cloud clusters by using pre-built and saved templates. In this case, Cloudera Director is installed on an Azure instance. Each cluster will run the exact same code for a Hive ETL query. The only thing that changes from one run to the next is the Director configuration file that is launched in order for it to speak to the respective cloud provider.
Example configuration file in AWS and GCP. The red sections are cloud provider-based information, such as connection details and instance types. The blue section is the cluster definition and stays identical – same components, same versions, same configurations.
In this demo we also leverage Amazon S3 as the underlying object store for all three clouds. This means that there is only one copy of the data being stored (and paid for) while the cost mitigation can come from the compute executing in whichever provider is offering the most attractive price point at the time of execution. Note that any preferred type of storage can be leveraged. For the sake of this demo, Amazon S3 was selected. We made this decision for simplicity, however in a production environment a detailed cost analysis of compute and storage should be made to determine the best storage decision.
Logical diagram of the demo setup. Same hive_job.sh launched in the 3 cloud providers. The director instance orchestrates the deployments. Data is read from and stored to an S3 bucket. The analyst can query the results with their favorite BI tool, leveraging an Impala analytic cluster.
Step by step instructions as demonstrated in video:
- Install Cloudera Director – in this demo we are using Azure to host the instance
- Create templates for each cloud provider – sample ones can be found here https://github.com/cloudera/director-scripts/tree/master/configs
- Leverage Cloudera Director’s API to launch a cluster using your templates
- Verify that the clusters are bootstrapping via the Director client interface
- Execute your workload on your cluster, without any code changes
- (OPTIONAL) Leverage a ‘dispatch’ script, to fully automate cluster deployment, job execution, and cluster termination for transient workloads
At Cloudera we believe that our clients should have the ability to minimize their risk and cost for all projects. With Cloudera’s cloud offering it is now possible to run on the public cloud, on-premises, or on private cloud – whichever infrastructure is most appropriate based on cost, needs, and use case. The client maintains control of their application, and ultimately has the leverage in negotiations with infrastructure providers. Lastly, Cloudera still brings the most integrated, tested, reliable, performant, secure, easy-to-use distribution, and top of the line support to the cloud, as it has on-premises for the last decade.
In this post, we’ve shown you how Cloudera Director can facilitate your organization leveraging the price-competition of the major cloud providers to ensure you can seamlessly run the same big data applications in house or on the public cloud. Cloudera Altus, which was publicly announced on May 24, 2017, will facilitate workloads running in a Cloudera-managed platform, allowing even further price optimization for truly transient workloads. Please look to Cloudera Altus posts such as, Data Engineering with Cloudera Altus, for further details around this offering.