Category Archives: Spark

Bi-temporal data modeling with Envelope

Categories: CDH Data Ingestion Impala Kudu Spark

One of the most fundamental aspects a data model can convey is how something changes over time. This makes sense when considering that we build data models to capture what is happening in the real world, and the real world is constantly changing. The challenge is that it’s not just that new things are occurring, it’s that existing things are changing too, and if in our data models we overwrite the old state of an entity with the new state then we have lost information about the change.

Read more

Data Engineering with Cloudera Altus

Categories: Altus Cloud Hive Spark

With modern businesses dealing with an ever-increasing volume of data, and an expanding set of data sources, the data engineering process that enables analysis, visualization, and reporting only becomes more important.

When considering running data engineering workloads in the public cloud, there are capabilities which enable different operational models from on-premises deployments. The key factors here are the presence of a distinct storage layer within the cloud environment, and the ability to provision compute resources on-demand (e.g.: with Amazon’s S3 and EC2 respectively).

Read more

Reading data securely from Apache Kafka to Apache Spark

Categories: CDH Kafka Platform Security & Cybersecurity Sentry Spark


With an ever-increasing number of IoT use cases on the CDH platform, security for such workloads is of paramount importance. This blog post describes how one can consume data from Kafka in Spark, two critical components for IoT use cases, in a secure manner.

The Cloudera Distribution of Apache Kafka 2.0.0 (based on Apache Kafka 0.9.0) introduced a new Kafka consumer API that allowed consumers to read data from a secure Kafka cluster.

Read more

Create conda recipe to use C extended Python library on PySpark cluster with Cloudera Data Science Workbench

Categories: CDH Data Science How-to Spark

Cloudera Data Science Workbench provides data scientists with secure access to enterprise data with Python, R, and Scala. In the previous article, we introduced how to use your favorite Python libraries on an Apache Spark cluster with PySpark. In Python world, data scientists often want to use Python libraries, such as XGBoost, which includes C/C++ extension. This post shows how to solve this problem creating a conda recipe with C extension.

Read more

The Benefits of Migrating HPC Workloads To Apache Spark

Categories: CDH Data Science Hadoop Spark


Recently we worked with a customer that needed to run a very significant amount of models in a given day to satisfy internal and government regulated risk requirements.  Several thousand model executions would need to be supported per hour.  Total execution time was very important to this client.  In the past the customer used thousands of servers to meet the demand.  They need to run many derivations of this model with different economic factors to satisfy their requirements.

Read more