There are many reasons to run a big data distribution, such as Cloudera Data Hub (CDH) and Hortonworks Data Platform (HDP), in the cloud with Infrastructure-as-a-Service (IaaS). The main reason is agility. When the business needs to onboard a new use case, a data admin can bring on additional virtual infrastructure to their clusters in the cloud in minutes or hours. With an on-prem cluster, it may take weeks or months to add the infrastructure capacity for the new use cases.
Another key reason is isolation. Companies can create clusters for specific workloads, ensuring resources are available for these workloads to meet their SLAs. While this may be inefficient as utilization may not be optimal, having clusters for specific workloads does address the “noisy neighbor” problem.
But, as Arun Murthy pointed out in this blog post, big data distros were built for a time period where a few key principles, such as co-location of compute and storage, were paramount (Gen-1). Here are some of the areas where this architecture falls short when running on cloud infrastructure.
- Elasticity: The ability to scale up and (very importantly) down as workload demands change. Adding and then reducing capacity automatically isn’t possible with these distros, which means that queries and workloads don’t get the performance needed to complete in a timely manner. By only providing the capability to scale up, companies are faced with additional costs to meet peak demand when the average demand is much lower.
- Transience: The power to suspend a cluster or only have it active for a short period of time. Compute demands can be transitory in nature, but the Gen-1 architecture can’t support this capability. With cloud you only pay for what you use, but if you workloads are transient but your compute is not, then you will be wasting money paying for compute that is under-utilized
- Efficiency: These distros do not utilize cloud infrastructure cost-effectively, because they expect the cluster resources to be “always-on”. This is fine for infrastructure you buy/operate as increased utilization drive cost efficiency, but always on infrastructure in the cloud is very expensive. In addition, running HDFS on block storage in the cloud is much more expensive than object storage – and a big motivator to move on from a Gen-1 architecture
Some people have tried using a cluster architecture based on cloud principles (eg separation of compute and storage) for their data platform (Gen 2) but there are still capabilities missing. For example, a layer that tracks metadata/security/governance even with transient compute.
Join us for a webinar for more information on a better option (Gen 3) for big data workloads on cloud infrastructure. In this webinar, we cover some of the key differences of this next-generation data architecture, including:
- Analytic services that are easy to use and self-service with elasticity to scale up and down automatically
- A consistent layer that provides security, governance, and metadata for all workloads and data, even if they are transient.
- Containers and Kubernetes that ensure every workload can be provisioned to be isolated
- An integrated platform that connects streaming, analytics, and machine learning which will dramatically accelerate the onboarding of new use cases
Get more details by attending the webinar – Enhance your CDH and HDP Clusters in the Cloud.
I think main advantage of cloud being discussed everywhere is scaling up and down on demand.
Having worked on hadoop for more than 7 years, I have rarely seen use cases apart from Machine learning/Data Science one’s where demand can rapidly fluctuate. Most of the use cases are batch and analytical which require more or less same compute in day 2 day runs. Also, about provisioning cluster in hours on cloud : Never has it happened that an use case is discussed and platform is needed within hours to execute, it goes through a planning phase which anyways takes a month or more.