Apache Impala and Apache Kudu make a great combination for real-time analytics on streaming data for time series and real-time data warehousing use cases. More than 200 Cloudera customers have implemented Apache Kudu with Apache Spark for ingestion and Apache Impala for real-time BI use cases successfully over the last decade, with thousands of nodes […]
At the end of March, we released the first version of Cloudera SQL Stream Builder as part of Cloudera Streaming Analytics 1.3. It enabled users to easily write, run and manage real-time SQL queries on streams from Apache Kafka with an exceptionally smooth user experience. Since then, we have been working hard to expose the […]
When Kudu was first introduced as a part of CDH in 2017, it didn’t support any kind of authorization so only air-gapped and non-secure use cases were satisfied. Coarse-grained authorization was added along with authentication in CDH 5.11 (Kudu 1.3.0) which made it possible to restrict access only to Apache Impala where Apache Sentry policies […]
Introduction In database systems one of the most effective ways to improve performance is to avoid doing unnecessary work, such as network transfers and reading data from disk. One of the ways Apache Kudu achieves this is by supporting column predicates with scanners. Pushing down column predicate filters to Kudu allows for optimized execution by […]
This is part of our series of blog posts on recent enhancements to Impala. The entire collection is available here. Apache Impala is synonymous with high-performance processing of extremely large datasets, but what if our data isn’t huge? What if our queries are very selective? The reality is that data warehousing contains a large variety […]
Users today are asking ever more from their data warehouse. This is resulting in advancements of what is provided by the technology, and a resulting shift in the art of the possible. As an example of this, in this post we look at Real Time Data Warehousing (RTDW), which is a category of use cases […]
Editor’s Note, August 2020: CDP Data Center is now called CDP Private Cloud Base. You can learn more about it here. Cloudera Data Platform (CDP) Data Center(DC) is the on-premises release of Cloudera Data Platform. CDP DC combines the best services and components from Cloudera Enterprise Data Hub and Hortonworks Data Platform Enterprise along with […]
This post describes an architecture, and associated controls for privacy, to build a data platform for a nationwide proactive contact tracing solution. Background After calls for a way of using technology to facilitate the lifting of restrictions on freedom of movement for people not self isolating, whilst ensuring regulatory obligations such as the UK Human […]
Time Series as Fast Analytics on Fast Data Since the open-source introduction of Apache Kudu in 2015, it has billed itself as storage for fast analytics on fast data. This general mission encompasses many different workloads, but one of the fastest-growing use cases is that of time-series analytics. Time series has several key requirements: High-performance […]
This post was authored by guest author Cheng Xu, Senior Architect (Intel), as well as Adar Lieber-Dembo, Principal Engineer (Cloudera) and Greg Solovyev, Engineering Manager (Cloudera) Overview Intel Optane DC persistent memory (Optane DCPMM) has higher bandwidth and lower latency than SSD and HDD storage drives. These characteristics of Optane DCPMM provide a significant performance […]