Cloudera Engineering Blog · How-to Posts
Unique across all options, Cloudera Manager makes it easy to do what would otherwise be a disruptive operation for operators and users.
For the increasing number of customers that rely on enterprise data hubs (EDHs) for business-critical applications, it is imperative to minimize or eliminate downtime — thus, Cloudera has focused intently on making software upgrades a routine, non-disruptive operation for EDH administrators and users.
Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.
Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level; all you get is a filesystem and whatever else you can dream up — well, code up).
Learn how HiveServer, Apache Sentry, and Impala help make Hadoop play nicely with BI tools when Kerberos is involved.
In 2010, I wrote a simple pair of blog entries outlining the general considerations behind using Apache Hadoop with BI tools. The Cloudera partner ecosystem has positively exploded since then, and the technology has matured as well. Today, if JDBC is involved, all the pieces needed to expose Hadoop data through familiar BI tools are available:
Did you know that using the Crunch API is a powerful option for doing time-series analysis?
Apache Crunch is a Java library for building data pipelines on top of Apache Hadoop. (The Crunch project was originally founded by Cloudera data scientist Josh Wills.) Developers can spend more time focused on their use case by using the Crunch API to handle common tasks such as joining data sets and chaining jobs together in a pipeline. At Cloudera, we are so enthusiastic about Crunch that we have included it in CDH 5! (You can get started with Apache Crunch here and here.)
The internals of Oozie’s ShareLib have changed recently (reflected in CDH 5.0.0). Here’s what you need to know.
In a previous blog post about one year ago, I explained how to use the Apache Oozie ShareLib in CDH 4. Since that time, things have changed about the ShareLib in CDH 5 (particularly directory structure), so some of the previous information is now obsolete. (These changes went upstream under OOZIE-1619.)
Getting started with Spark (now shipping inside CDH 5) is easy using this simple example.
(Editor’s note – this post has been updated to reflect CDH 5.1/Spark 1.0)
Improved scheduling capabilities via Oozie in CDH 5 makes for far fewer headaches.
One of the best new Apache Oozie features in CDH 5, Cloudera’s software distribution, is the ability to use
cron-like syntax for coordinator frequencies. Previously, the frequencies had to be at fixed intervals (every hour or every two days, for example) – making scheduling anything more complicated (such as every hour from 9am to 5pm on weekdays or the second-to-last day of every month) complex and difficult.
The conclusion to this series covers how to use scans, and considerations for choosing the Thrift or REST APIs.
In this series of how-tos, you have learned how to use Apache HBase’s Thrift interface. Part 1 covered the basics of the API, working with Thrift, and some boilerplate code for connecting to Thrift. Part 2 showed how to insert and to get multiple rows at a time. In this third and final post, you will learn how to use scans and some considerations when choosing between REST and Thrift.
Scanning with Thrift
The CDH software stack lets you use your tool of choice with the Parquet file format – - offering the benefits of columnar storage at each phase of data processing.
An open source project co-founded by Twitter and Cloudera, Parquet was designed from the ground up as a state-of-the-art, general-purpose, columnar file format for the Apache Hadoop ecosystem. In particular, Parquet has several features that make it highly suited to use with Cloudera Impala for data warehouse-style operations:
This quick demo illustrates how easy it is to implement role-based access and control in Impala using Sentry.
Apache Sentry (incubating) is the Apache Hadoop ecosystem tool for role-based access control (RBAC). In this how-to, I will demonstrate how to implement Sentry for RBAC in Impala. I feel this introduction is best motivated by a use case.