Category Archives: Hadoop

Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

Categories: Avro Guest Hadoop HBase Kudu Parquet

Zbigniew Baranowski is a database systems specialist and a member of a group which provides and supports central database and Hadoop-based services at CERN. This blog was originally released on CERN’s “Databases at CERN” blog, and is syndicated here with CERN’s permission.

 

TOPIC

This post presents a performance comparison of few popular data formats and storage engines available in the Apache Hadoop ecosystem: Apache Avro,

Read More

New in Cloudera Enterprise 5.10: Hue SQL Editor and Security Improvements

Categories: Hadoop Hue Oozie

Cloudera Enterprise 5.10 includes the latest updates of Hue, the intelligent editor for SQL Developers and Analysts.

As part of Cloudera’s continuing investments in user experience and productivity, Cloudera Enterprise 5.10 includes an updated version of Hue. We provide a summary of the main enhancements in the following part of this blog post. (Hue from C5.10 is also available for a quick try in one click on demo.gethue.com.)

SQL Improvements

The Hue editor keeps getting better with these major improvements:

Row Count

The number of rows returned is displayed so you can quickly see the size of the dataset.

Read More

Analyzing US flight data on Amazon S3 with sparklyr and Apache Spark 2.0

Categories: CDH Data Science Hadoop Spark Use Case

We posted several blog posts about sparklyr (introduction, automation), which enables you to analyze big data leveraging Apache Spark seamlessly with R. sparklyr, developed by RStudio, is an R interface to Spark that allows users to use Spark as the backend for dplyr, which is the popular data manipulation package for R.

If you are interested in sparklyr, you can learn how to use it with the official document,

Read More

Working with UDFs in Apache Spark

Categories: Hadoop Spark

User-defined functions (UDFs) are a key feature of most SQL environments to extend the system’s built-in functionality.  UDFs allow developers to enable new functions in higher level languages such as SQL by abstracting their lower level language implementations.  Apache Spark is no exception, and offers a wide range of options for integrating UDFs with Spark SQL workflows.

In this blog post, we’ll review simple examples of Apache Spark UDF and UDAF (user-defined aggregate function) implementations in Python,

Read More

Up and running with Apache Spark on Apache Kudu

Categories: CDH Data Ingestion Data Science General Hadoop How-to Impala Kudu Spark Training Use Case

After the GA of Apache Kudu in Cloudera CDH 5.10, we take a look at the Apache Spark on Kudu integration, share code snippets, and explain how to get up and running quickly, as Kudu is already a first-class citizen in Spark’s ecosystem.

 

As the Apache Kudu development team celebrates the initial 1.0 release launched on September 19, and the most recent 1.2.0 version now GA as part of Cloudera’s CDH 5.10 release,

Read More