Using SQL to democratize streaming data

Streaming analytics is crucial to modern business – it opens up new product opportunities and creates massive operational efficiencies. In many cases, it’s the difference between creating an outstanding customer experience versus a poor one – or losing the customer altogether.

However, in the typical enterprise, only a small team has the core skills needed to gain access and create value from streams of data. This data engineering skillset typically consists of Java or Scala programming skills mated with deep DevOps acumen. A rare breed. The result is that streaming data tends to be “locked away” from everyone but a small few, and the data engineering team is highly overworked and backlogged.

Contrast that with the skills honed over decades for gaining access, building data warehouses, performing ETL, creating reports and/or applications using structured query language (SQL). The declarative nature of the SQL language makes it a powerful paradigm for getting data to the people who need it. In the typical enterprise, there are significantly more SQL skills than programming skills. It’s also worth noting that even those with Java skills will often prefer to work with SQL – if for no other reason than to share the workload with others in their organization that only know SQL. Many of the existing visual business intelligence and dashboard tools also use SQL as a standard language.

What do you mean by democratizing?

Democratizing data refers to a mechanism that provides a self-serve paradigm and culture for an ever-growing internal audience to get the data they need to add value to the business. They no longer need to ask a small subset of the organization to provide them with information, rather, they have tooling, systems, and capabilities to get the data they need. Data Democratization has been a topic of conversation for the last few years – but mostly centered around data warehousing and data lakes.

In the recent past, self-service platforms have started to become democratized – urging users to gain access to data streams and make use of them. Large enterprises have built internal tooling and platforms that allow users to create stream processors and build streaming data applications. But, there is generally only low-level language support and these platforms are proprietary – built with a specific organization in mind.

The difficulty with querying streams

Streaming data systems are a relatively new addition to enterprise data systems, and have evolved to providing business-critical roles. Thus, it’s no surprise in this era of rapid development that tooling hasn’t evolved yet for streaming systems as more traditional batch systems. One large gap is the ability to inspect data from streams – it turns out it’s tricky to understand what data is in an ever-mutating stream, then filter, aggregate, and process the stream like in traditional systems. This is a task best left to expert Java programming minds. Many times, users are left to push the stream of data into a traditional database, data lake, or data warehouse just to perform these simple computations. This wastes valuable time, increases costs, and creates a time-to-information problem.

Worse, the tooling that does exist is extremely limited – missing important features that the enterprise has become used to in order to be effective like real-time feedback, schema management, rich grammar, user-defined functions, production quality deployment/management frameworks, and more. All of these limitations clearly explain the impediments to streaming data being democratized like traditional database systems have become.

So without the ability to easily query streams of data, how can organizations hope to democratize streams of data?

SQL as the democratization enabler

Structured Query Language (SQL) has enjoyed half a century of dominance as the de facto interface to data. For good reason – it’s easy to use, mature, powerful, and completely ubiquitous.

But as data streaming technologies like Apache Kafka and Apache Flink have evolved, only until recently have SQL interfaces become deeply integrated. Part of the problem is that in order to query streams of data, SQL itself had to evolve. Continuous SQL or Streaming SQL contains grammar and functions that allow users rich control over time control (imperative for streams), aggregation as well as filtering. Materializing data into views (materialized views) has become an excellent mechanism to interface with an entire ecosystem of existing tooling – from dashboarding programs, notebooks for ML or AI, or analytics applications. The rate of innovation in this area is high and continually evolving – truly the tip of the spear in computer science.

As the ecosystem grows – so does the ability for it to become the great enabler for democratization. Solutions need to expertly weave the ease of use of SQL with powerful back-end processing and scaling capabilities – and do it within an enterprise-wide governance, lineage, and scalability platform. Data engineering teams need to feel secure in knowing users get the data they need, but enterprise security, regulatory compliance, and overall scalability aren’t compromised in the process. Ultimately these capabilities form the solution for end-user self-service and the holy grail of streaming data democratization.

At Cloudera, we are laser-focused on doing just that – a robust streaming platform that treats anyone in the organization who knows SQL as a first-class citizen with rich and robust self-service capabilities. It is with this same intent we had made a strategic acquisition of Eventador last October. We are actively integrating those capabilities into our Cloudera Data Platform. For more information, check out what we have been up to and what we are up to now, here.

Kenny Gorman
Kenny Gorman

Product Owner - Stream Processing, Cloudera Inc.

Leave a comment

Your email address will not be published. Links are not permitted in comments.