Category Archives: CDH

The Cloudera Developer Program: The Low-cost, Low-risk Way to Develop on Cloudera

Categories: CDH Cloudera Manager General

The Cloudera Developer Program is kind of amazing. Here’s why.

For those with a desire to build new applications on Cloudera’s platform, historically there’s been a gap to cross between pure bootstrapping on CDH (whether via a small on-premise cluster, in the public cloud, or using Cloudera Live) and obtaining full-blown support for a complete enterprise data hub with all the fixings (including Cloudera cloudera-developer-programNavigator). For individuals who have moved beyond self-learning and are getting “serious,”

Read More

Apache Hive 2.0 is Released

Categories: CDH Hive

The recently-released Apache Hive 2.0 contains some exciting improvements, many of which are already available in CDH 5.x.

Recently, the Apache Hive community announced Hive 2.0.0. This is a larger release compared to the previous one (covered here), with a lengthy list of new features (many experimental), enhancements, and bug fixes. Cloudera’s Hive team have been working with the community for months to drive toward this significant release.

Read More

Making Python on Apache Hadoop Easier with Anaconda and CDH

Categories: CDH Cloudera Manager Data Science Spark

Enabling Python development on CDH clusters (for PySpark, for example) is now much easier thanks to new integration with Continuum Analytics’ Python platform (Anaconda).

Python has become an increasingly popular tool for data analysis, including data processing, feature engineering, machine learning, and visualization. Data scientists and data engineers enjoy Python’s rich numerical and analytical libraries—such as NumPy, pandas, and scikit-learn—and have long wanted to apply them to large datasets stored in Apache Hadoop clusters.

Read More

New in CDH 5.5: Apache Parquet Usability Improvements

Categories: CDH HDFS Hive Impala Parquet Performance

Fixes in CDH 5.5 make writing Parquet data for Apache Impala (incubating) much easier.

Over the last few months, several Cloudera customers have provided the feedback that Parquet is too hard to configure, with the main problem being finding the right layout for great performance in Impala. For that reasons, CDH 5.5 contains new features that make those configuration problems go away.

Auto-Detection of HDFS Block Size

For example,

Read More