Zbigniew Baranowski is a database systems specialist and a member of a group which provides and supports central database and Hadoop-based services at CERN. This blog was originally released on CERN’s “Databases at CERN” blog, and is syndicated here with CERN’s permission.
This post presents a performance comparison of few popular data formats and storage engines available in the Apache Hadoop ecosystem: Apache Avro,
Earlier this week, RStudio announced sparklyr, a new package that provides an interface between R and Apache Spark. We republish RStudio’s blog post below (see original) for your convenience.
Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more.
In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.
With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose.
Microsoft recently announced a new Impala Connector for the Power BI Desktop (currently a preview, with GA expected early in 2017). Cloudera is also working with Microsoft’s Power BI Engineering team to certify it against Impala to ensure it meets critical enterprise requirements such as security. The following Microsoft post about the new connector, by Power BI senior program manager Miguel Llopis, is re-published below for your convenience.
In the Power BI Desktop July 2016 Update,
Learn how the performance advantages of the Crypto cryptographic library will provide an upgrade for Spark shuffle encryption over the current approach.
When running a big data computing job, the data being processed may contain sensitive information that users don’t want anyone else to access. Encrypting that sensitive data is becoming more and more important, especially for enterprise users.
For Apache Spark, which is the emerging standard for big data processing,