Fixes in CDH 5.5 make writing Parquet data for Apache Impala (incubating) much easier.
Over the last few months, several Cloudera customers have provided the feedback that Parquet is too hard to configure, with the main problem being finding the right layout for great performance in Impala. For that reasons, CDH 5.5 contains new features that make those configuration problems go away.
Auto-Detection of HDFS Block Size
For example, you may have seen this warning: Read <some-big-number>
The Kite project recently released a stable 1.0!
This milestone means that Kite’s data API and command-line tools is ready for long-term use.
The 1.0 data modules and API are no longer rapidly changing. From 1.0 on, Kite will be strict about breaking compatibility and will use semantic versioning to signal what compatibility guarantees you can expect from a given release.
Community contributions to Parquet are increasing in parallel with its adoption. Here are some of the highlights.
Apache Parquet (incubating), the open source, general-purpose columnar storage format for Apache Hadoop, was co-founded only 18 months ago by Cloudera and Twitter. Since that time, its rapid adoption by multiple platform vendors and communities has made it a de facto standard for this purpose.
Most of Cloudera’s recent contributions to have focused on fixing bugs reported by its growing number of users.
Kite SDK’s new release contains new improvements that make working with data easier.
Recently, Kite SDK, the open source toolset that helps developers build systems on the Apache Hadoop ecosystem, became a 0.15.0. In this post, you’ll get an overview of several new features and bug fixes.
Working with Datasets by URI
The new Datasets class lets you work with datasets based on individual dataset URIs.
Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.
Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level;