Thanks to Richard Williamson of Silicon Valley Data Science for allowing us to republish the following post about his sample application based on Apache Spark, Apache Kudu (incubating), and Apache Impala (incubating).
Why should your infrastructure maintain a linear growth pattern when your business scales up and down during the day based on natural human cycles? There is an obvious need to maintain a steady baseline infrastructure to keep the lights on for your business,
Cloudera Engineering has developed (and recently open sourced) a distributed unit testing framework that cuts testing time from multiple hours to just 10 minutes.
Upstream unit tests are Cloudera’s first line of defense for finding and fixing software bugs, as part of a multidimensional process that also includes static/dynamic code analysis, fault injection, integration/scale/endurance testing, and validation on real workloads. However, running a full unit test suite for Apache Hadoop ecosystem components can take hours,
Using Apache Impala (incubating) on top of Apache Kudu (incubating) has significant performance benefits
Apache Kudu (incubating) is the newest addition to the set of storage engines that integrate with the Apache Hadoop ecosystem. The promise of Kudu is to deliver high-scan performance, targeting analytical workloads, while allowing users to concurrently insert, update, and delete records. With these properties, Kudu becomes a viable alternative to existing combinations of HDFS and/or Apache HBase to achieve similar results with less complicated ETL pipelines,
Engineers from across the Apache Hadoop community are collaborating to establish Arrow as a de-facto standard for columnar in-memory processing and interchange. Here’s how it works.
Apache Arrow is an in-memory data structure specification for use by engineers building data systems. It has several key benefits:
- A columnar memory-layout permitting O(1) random access. The layout is highly cache-efficient in analytics workloads and permits SIMD optimizations with modern processors.
The following post was originally published in the Ibis project blog. (Ibis is a data analysis framework incubating in Cloudera Labs that brings Apache Hadoop scale to Python development.)
The new Apache Kudu (incubating) columnar storage engine together with Apache Impala (incubating) interactive SQL engine enable a new fully open source big data architecture for data that is arriving and changing very quickly. By integrating Kudu and Impala with Ibis,