We are now well into 2022 and the megatrends that drove the last decade in data—The Apache Software Foundation as a primary innovation vehicle for big data, the arrival of cloud computing, and the debut of cheap distributed storage—have now converged and offer clear patterns for competitive advantage for vendors and value for customers. Cloudera has been parlaying those patterns into clear wins for the community at large and, more importantly, streamlining the benefits of that innovation to our customers.
At Cloudera, we have had the benefit of an early start, and as a result we have customers who have large-scale deployments on mission-critical applications that have been in production for a number of years. We believe that, as one of the earliest pioneers of industrial strength open source software, we have had the opportunity and the experience to help drive an acceleration in the evolution of some very fundamental shifts in open source development.
What will we see in the decade ahead? Let’s discuss.
Open source in the next decade
Open source started out as a solution by developers to solve problems for other developers. Today, open source is widely recognized as a premier source for new innovations, and you can find its fingerprints in every company around the world.
As I look forward to the next decade of transformation, I see that innovating in open source will accelerate along three dimensions—project, architectural, and system. This represents the next step in the industrialization of open source innovation for data management and data analytics.
Project innovation for data management engines, storage engines, ML engines, data formats, table formats, or workload orchestration engines were and are foundational to the open source movement. These are innovations by developers, for developers, and as adoption of OSS projects has grown, innovation at the project level has accelerated sharply.
Architectural innovation was the second wave of evolution. As project-level innovators proved their expertise in providing solutions to point problems, the need opened up for building best-in-class solutions that offer interoperability, security, and governance across the entire lifetime of data, both on-prem and in the cloud. We see this process gathering steam in the way projects like Apache Iceberg have evolved.
System innovation is the next evolutionary step for open source. As businesses see the value of using open source to run their company, innovators are forced to consider capabilities such as backwards compatibility, upgrades, and infosec compliance as part of the package. The next decade will force system innovation, what we all know as enterprise readiness, as one of the core tenets of open source development.
The project-level innovation that brought forth products like Apache Hadoop, Apache Spark, and Apache Kafka is engineering at its finest. Developers working in different companies banded together to form the communities that fostered and drove innovation, whether it was in data formats, table formats, querying engines, or running ETL workloads for the vast amounts of data that could be landed in HDFS. This innovation was anchored in a handful of “seed” use cases that sparked the creation of these projects. Built in a meritocratic society where committership (the license to commit code) was the ticket to the inner sanctum of innovation, these projects delivered enough variety and differentiation that, even with the challenges of adopting these products for industrial scale applications, the value provided made it worth the effort. Today we see a number of new innovative projects solving different aspects of the big data ecosystem, including ones that Cloudera brought to life and have been championing very successfully like Apache Ozone and Apache YuniKorn. As events such as the zero-day Log4J exploit showed, communities need to lean in on securing the open source supply chain that powers these projects. Communities must ensure that the hundreds of essential libraries are free of CVEs, and that obsolete ones are dropped as a natural course of product evolution. One of the most critical decisions on any open source project going forward should be the decision to introduce a third party dependency of repute into the product.
Architectural innovation is the use of open source as a vehicle for bringing standards and interoperability across independent products as a way to further adoption and provide companies with more options and facilitate continuous innovation. The ultimate goal of this exercise is to reduce inter-engine complexity and decrease TCO for practitioners and enterprises. This is a critical part of value creation that OSS communities will be called on to deliver consistently.
In the past, Cloudera has taken the lead to deliver innovations such as Parquet or ORC to build interoperability across systems. We’ve also seen products such as Apache Ranger and Apache Atlas being adopted as industry standards for security and governance. More recently, industry leaders have collaborated in furthering the adoption of Apache Iceberg as an industry standard for big data, adding support for it in engines such as Hive and Impala. We expect to drive convergence across a broad swathe of the community on capabilities that will essentially turn Apache Iceberg into the de facto table format for SQL workloads, both in the cloud and on-prem.
A recent example of architectural innovation in open source is the ability to use 100% open source components to build an open data lakehouse that is both secure and governed. This is extremely liberating for enterprises who are then able to leverage different enterprise solutions based on this architecture.
Reducing time to value for enterprises, regardless of whether they are on-prem or in the cloud, is *the* value proposition for the ultimate IT buyer, the CIO. This is where system innovation steps in. Building products that have very clear and stable API contracts will allow third-party products to certify once, run anywhere, and address any backwards compatibility concerns. System innovation is about collaborating across projects and securing the open source supply chain so that the system as a whole is secure from the get go and can be remediated completely and easily.
An example of system innovation is the way the industry is approaching data mesh. To move data mesh beyond a buzzword, attention must move to the fundamental primitive that drives data meshes, i.e. the data set. It will take multiple open source projects to help define, curate, maintain, and provide secure access to a data set over its lifetime. This is an area where Cloudera has significant expertise and perspective to contribute to the open source community. We’re trusted by the world’s largest and most highly regulated companies and that expertise is a massive benefit as we evolve into a system innovation world.
Competing in the new decade
For the customers, open source facilitates industry-wide collaboration for continuous data innovation. Having seen the benefits of that, enterprises are unlikely to reward platforms that are either closed sourced or quasi close sourced, performance hobbled or eco-system hobbled, or built by a single vendor without a broad base of committers. Software enterprises that can harness multiple open source systems to deliver solutions that are hybrid, multi-cloud, and offer the most choice to customers will definitely have a continuous innovation advantage. And like a wise stock trader once told me, “I think that the technology arms race is all about executing a faster trade. I have to play that game, but ultimately I want to create value because I executed a better trade fast.” Enterprises want to spend more time solving their business problems and less time worrying about the innards of the product, and vendors that address that need will be rewarded for their execution.
The last decade was an exciting time in software development. Software truly started to eat the world, and digital transformation changed industries big and small and created new winners and losers. The next decade promises to be even more exciting as open source software development gets industrialized on a mega scale with the advent of system innovation. Cloudera taught the world the value of big data and is using that expertise to be at the forefront of the next wave, leading a new generation of open source innovators on their bold adventures.