Category Archives: Guest

Proactive Data Pipeline Alerting with Pulse

Categories: CDH Events Guest Search

In mid-2017, we were working with one of the world’s largest healthcare companies to put a new data application into production. The customer had grown through acquisition and in order to maintain compliance with the FDA, they needed to aggregate data in real-time from dozens of different divisions of the company. The consumers of this application, of course, did not care how we built the data pipeline. However, they cared greatly that if it broke,

Read more

Automated Provisioning of CDH in the Cloud with Cloudera Director and Ansible

Categories: CDH Cloud Cloudera Director Guest

This is a guest blog post from Jasper Pult, Technology Consultant at Lufthansa Industry Solutionsan international IT consultancy covering all aspects of Big Data, IoT and Cloud.  The below work was implemented using Director’s API v9 and certain API details might change in future versions.

Cloud computing is quickly replacing traditional on premises solutions in all kinds of industries. With Apache Hadoop workloads often varying in resource requirements over time,

Read more

Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

Categories: Avro Guest Hadoop HBase Kudu Parquet

Zbigniew Baranowski is a database systems specialist and a member of a group which provides and supports central database and Hadoop-based services at CERN. This blog was originally released on CERN’s “Databases at CERN” blog, and is syndicated here with CERN’s permission.

 

TOPIC

This post presents a performance comparison of few popular data formats and storage engines available in the Apache Hadoop ecosystem: Apache Avro,

Read more

Introducing sparklyr, an R Interface for Apache Spark

Categories: Data Science Guest Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.

sparklyr-illustration

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. 

Read more

Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group

Categories: Data Ingestion Guest Hadoop

In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.

With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose.

Read more