Category Archives: Guest

Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

Categories: Avro Guest Hadoop HBase Kudu Parquet

Zbigniew Baranowski is a database systems specialist and a member of a group which provides and supports central database and Hadoop-based services at CERN. This blog was originally released on CERN’s “Databases at CERN” blog, and is syndicated here with CERN’s permission.

 

TOPIC

This post presents a performance comparison of few popular data formats and storage engines available in the Apache Hadoop ecosystem: Apache Avro,

Read more

Introducing sparklyr, an R Interface for Apache Spark

Categories: Data Science Guest Spark

Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.

sparklyr-illustration

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

  • Interactively manipulate Spark data using both dplyr and SQL (via DBI).

Read more

Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group

Categories: Data Ingestion Guest Hadoop

In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.

With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose.

Read more

Microsoft Power BI Enables Connectivity to Apache Impala (Incubating)

Categories: Guest Impala

Microsoft recently announced a new Impala Connector for the Power BI Desktop (currently a preview, with GA expected early in 2017). Cloudera is also working with Microsoft’s Power BI Engineering team to certify it against Impala to ensure it meets critical enterprise requirements such as security. The following Microsoft post about the new connector, by Power BI senior program manager Miguel Llopis, is re-published below for your convenience.

In the Power BI Desktop July 2016 Update,

Read more

Securing Apache Spark Shuffle using Apache Commons Crypto

Categories: Guest Platform Security & Cybersecurity Spark

Learn how the performance advantages of the Crypto cryptographic library will provide an upgrade for Spark shuffle encryption over the current approach.

When running a big data computing job, the data being processed may contain sensitive information that users don’t want anyone else to access. Encrypting that sensitive data is becoming more and more important, especially for enterprise users.

For Apache Spark, which is the emerging standard for big data processing,

Read more