The first thing that comes to mind when talking about synergy is how 2+2=5. Being the writer that he is, Mark Twain described it a lot more eloquently as “the bonus that is achieved when things work together harmoniously”. There is a multitude of product and business examples to illustrate the point and I particularly like how car manufacturers can bring together relatively small engines to do big things.
To provide supercar performance in a more environmentally friendly way for the i8, BMW stepped away from ever bigger power plants. They paired the same 1.5-liter petrol engine as you’ll find in a MINI (but tuned to 225bhp) with a 129bhp electric motor to achieve 0-62mph in 4.4 seconds and economy up to 134.5mpg. The performance goes far beyond what either engine could have achieved individually.
The same opportunities exist when combining different analytics engines. In this blog, we’ll look at the synergy that results when Amazon’s EMR is augmented with Cloudera Data Platform (CDP) Public Cloud to deliver overall lower TCO, increased efficiency as well as improved security and governance, courtesy of CDP’s Shared Data Experience (SDX).
Challenges and impact
EMR provides a convenient way for organizations to easily run and scale Apache Spark, Hadoop, Hive, and other big data frameworks. Yet as the use of the application grows inside organizations, they come up against the following limiting factors:
- Without operational insights into workloads (e.g. query performance), it is difficult to address inadequate operation and missed SLAs. Without a clear root cause, organizations have no alternative but to add more nodes, which increases cloud infrastructure costs.
- Multi-tenant data access and data privacy regulation demand strict security and governance. In order to add this level of sophistication to EMR, organizations would have to manually add Apache Atlas and Apache Ranger components. Companies, therefore, have to invest additional time and skills to configure and maintain these components or end up duplicating data and infrastructure to create individually secure silos.
- Both previous points result in high TCO. In an attempt to keep costs down, organizations are limited to using EMR for single-stage, non-secure data analytics as the ideal use case. Multi-tenancy is emulated by a proliferation of single-tenant clusters with copied data, while others have no alternative but to take additional risks with insecure clusters or inadequate policies.
But what if those queries and workloads that are suffering from low performance could be improved?
Augmenting EMR with CDP
As BMW did for their i8 by combining two very different engines for capabilities that far outweigh each individually, CDP augments EMR to deliver improved combined performance at a lower overall cost.
With Workload Manager, CDP provides organizations the insight into how EMR queries and workloads perform, delivering complete visibility for performance tuning and migration. In fact, moving workloads between the two engines becomes a mere formality. Gone is the need to replicate data since both EMR and CDP can both work with the same S3 object store buckets. Metadata is easily shared with CDP through AWS Glue or the Hive Metastore, leaving just the queries or workloads to be moved to CDP.
There are further benefits to be had from CDP. Workload Manager’s insight also drives recommendations on how queries may be improved for more efficient use of cloud infrastructure, an effect that is further amplified through CDP’s more current and faster processing engines. CDP’s SDX virtually eliminates the operational impact of providing enterprise-grade security and governance for all deployments. This, in turn, further reduces infrastructure costs through the elimination of data duplication and one-cluster-one-policy approaches.
To find out more and see how you can augment EMR with CDP for overall lower TCO, improved combined performance, and safe and secure multi-tenancy, register for our Optimize AWS EMR with Cloudera Data Platform Webinar.