Over the past several years, data leaders asked many questions about where they should keep their data and what architecture they should implement to serve an incredible breadth of analytic use cases. Vendors with proprietary formats and query engines made their pitches, and over the years the market listened, and data leaders made their decisions.
The most interesting thing about their choices is that, despite the millions of marketing dollars vendors spent trying to convince customers that they built the next greatest data platform, there has been no clear winner.
Many companies adopted the public cloud, but very few organizations will ever move everything to the cloud, or to a single cloud. The future for most data teams will be multi-cloud and hybrid. And although there is clear momentum behind the data lakehouse as the ideal architecture for multi-function analytics, the demand for open table formats including Apache Iceberg is a clear signal that data leaders value interoperability and engine freedom. It no longer matters where the data is. What matters is how we understand it and make it available to share, and use.
The direction is clear. Proprietary formats and vendor lock-in are a thing of the past. Open data is the future. And for that future to be a reality, data teams must shift their attention to metadata, the new turf war for data.
The need for unified metadata
While open and distributed architectures offer many benefits, they come with their own set of challenges. As companies seek to deliver a unified view of their entire data estate for analytics and AI, data teams are under pressure to:
- Make data easily consumable, discoverable, and useful to a wide range of technical and non-technical data consumers
- Improve the accuracy, consistency, and quality of data
- Ensure the efficient querying of data, including high availability, high performance, and interoperability with multiple execution engines
- Apply consistent security and governance policies across their architecture
- Achieve high performance while managing costs
The answer to unifying the data has traditionally been to move or copy data from one source or system to another. The problem with that approach is that data copies and data movement actually undermine all five of the points above, increasing costs while making it more difficult to manage and trust the data as well as the insights derived from it.
This leads us to a new frontier of data management, which is especially critical for teams managing distributed architectures. Unifying the data isn’t enough. Data teams actually need to unify the metadata.
There are two types of metadata, and they both serve critical functions within the data lifecycle:
Operational metadata supports the data team’s goals of securing, governing, processing, and exposing the data to the right data consumers while also keeping queries against that data performant. Data teams manage this metadata with a metastore.
Business metadata is metadata that supports data consumers who want to discover and leverage that data for a broad range of analytics. It provides context so users can easily find, access, and analyze the data they’re looking for. Business metadata is managed with a data catalog.
Many solutions manage at least one of these types of metadata well. A few solutions manage both. However, there are very few platforms that can unify and manage business and operational metadata from on-premises and cloud environments as well as metadata from multiple disparate tools and systems. Additionally, almost none of the available tools do all of that and also provide the automation required to scale these solutions for enterprise environments.
Cloudera is built on open metadata
Cloudera’s open data lakehouse is built on Apache Iceberg, which makes it easy to manage operational metadata. Iceberg maintains the metadata within the table itself, eliminating the need for metadata lookups during query planning and simplifying formerly complex data management tasks like partition and schema evolution. With Cloudera’s open data lakehouse, data teams store and manage a single physical copy of their data, eliminating additional data movement and data copies and ensuring a consistent and accurate view of their data for every data consumer and analytic use case.
Cloudera also supports the REST catalog specification for Iceberg, ensuring that table metadata is always open and easily accessible by third-party execution engines and tools. While a lot of vendors are focused on locking in metadata, Cloudera remains cloud- and tool-agnostic to ensure customers continue to have the freedom to choose.
Cloudera is also working on accessing and tracking metadata outside of the Cloudera ecosystem, so data teams will have visibility across their entire data estate, including data stored in a variety of other platforms and solutions.
Automating business metadata is the key to achieving scale
While operational metadata is often generated by a system and maintained within Iceberg tables, business metadata is often generated by domain experts or data teams. In an enterprise environment, which often features hundreds or even thousands of data sources, files, and tables, scaling the human effort required to ensure these datasets are easily discoverable is impossible.
Cloudera’s vision is to augment the data catalog experience and remove the manual effort of generating business metadata. Customers will be able to leverage Generative AI to ensure that every dataset is properly tagged and classified, and is easily discoverable. With an automated business metadata solution, data consumers and data teams can easily find the data they’re looking for, even with huge catalogs, and no dataset will fall through the cracks.
Unified security and governance
Data teams strive to balance the need for broad access to data for every data consumer with centralized security and governance. That task becomes much more complicated in distributed environments, and in situations where the data moves from its source to another destination.
Cloudera Shared Data Experience (SDX) is an integrated set of security and governance technologies for tracking metadata across distributed environments. It ensures that access control and security policies that are set once still apply wherever and however that data is accessed, so data teams know that only the right data consumers have access to the right datasets, and the most sensitive data is protected. Unlike decentralized and siloed data systems, having a centralized and trusted security management layer makes it easier to democratize data with the confidence that nobody will have unauthorized access to data. From a governance perspective, data teams have control over and visibility into the health of their data pipelines, the quality of their data products, and the performance of their execution engines.
The metadata turf wars have just begun
As data teams adopt hybrid, distributed data architectures, managing metadata is critical to providing a unified self-service view of the data, to delivering analytic insights that data consumers trust, and to ensuring security and governance across the entire data estate.
Chief Data Analytics Officers can take some important lessons from the data wars onto this new battlefield:
- Choose open metadata: Don’t lock your metadata into a single solution or platform. Iceberg is a great tool for ensuring openness and interoperability with a large commercial and open source software ecosystem.
- Unify metadata management: Invest in a metadata management solution that unifies operational and business metadata across all environments and systems, even third-party tools and platforms.
- Automation and Scalability: Leverage automation to handle the scale and complexity of creating and managing metadata in large, distributed environments.
- Centralized Security and Governance: Ensure that security and governance policies are consistently applied and enforced across the entire data landscape to protect sensitive data and ensure the health and performance of your data estate.
These are the guiding principles of Cloudera’s metadata management solutions, and why Cloudera is uniquely positioned to support an open metadata strategy across distributed enterprise environments.
Learn more about Cloudera’s metadata management solutions here.