Author Archives: Ryan Blue

New in CDH 5.5: Apache Parquet Usability Improvements

Categories: CDH HDFS Hive Impala Parquet Performance

Fixes in CDH 5.5 make writing Parquet data for Apache Impala (incubating) much easier.

Over the last few months, several Cloudera customers have provided the feedback that Parquet is too hard to configure, with the main problem being finding the right layout for great performance in Impala. For that reasons, CDH 5.5 contains new features that make those configuration problems go away.

Auto-Detection of HDFS Block Size

For example,

Read More

Progress Report: Community Contributions to Parquet

Categories: Community Parquet

Community contributions to Parquet are increasing in parallel with its adoption. Here are some of the highlights.

Apache Parquet (incubating), the open source, general-purpose columnar storage format for Apache Hadoop, was co-founded only 18 months ago by Cloudera and Twitter. Since that time, its rapid adoption by multiple platform vendors and communities has made it a de facto standard for this purpose.

Parquet logo

Most of Cloudera’s recent contributions to have focused on fixing bugs reported by its growing number of users.

Read More

What’s New in Kite SDK 0.15.0?

Categories: Kite SDK Tools

Kite SDK’s new release contains new improvements that make working with data easier.

Recently, Kite SDK, the open source toolset that helps developers build systems on the Apache Hadoop ecosystem, became a 0.15.0. In this post, you’ll get an overview of several new features and bug fixes.

Working with Datasets by URI

The new Datasets class lets you work with datasets based on individual dataset URIs.

Read More

How-to: Use Kite SDK to Easily Store and Configure Data in Apache Hadoop

Categories: HBase HDFS How-to Kite SDK

Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.

Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level;

Read More