Cloudera Engineering Blog · How-to Posts

How-to: Make Hadoop Accessible via LDAP

Integrating Hue with LDAP can help make your secure Hadoop apps as widely consumed as possible.

Hue, the open source Web UI that makes Apache Hadoop easier to use, easily integrates with your corporation’s existing identity management systems and provides authentication mechanisms for SSO providers. So, by changing a few configuration parameters, your employees can start analyzing Big Data in their own browsers under an existing security policy.

How-to: Get Started Writing Impala UDFs

Cloudera provides docs and a sample build environment to help you get easily started writing your own Impala UDFs.

User-defined functions (UDFs) let you code your own application logic for processing column values during a Cloudera Impala query. For example, a UDF could perform calculations using an external math library, combine several column values into one, do geospatial calculations, or other kinds of tests and transformations that are outside the scope of the built-in SQL operators and functions.

How-to: Use Impala on Amazon EMR

Developers, rejoice: Impala is now available on EMR for testing and evaluation.

Very recently, Amazon Web Services announced support for running Cloudera Impala queries on its Elastic MapReduce (EMR) service. This is very good news for EMR users — as well as for users of other platforms interested in kicking Impala’s tires in a friction-free way. It’s also yet another sign that Impala is rapidly being adopted across the ecosystem as the gold standard for interactive SQL and BI queries on Apache Hadoop.

How-to: Do Statistical Analysis with Impala and R

The new RImpala package brings the speed and interactivity of Impala to queries from R.

Our thanks to Austin Chungath, Sachin Sudarshana, and Vikas Raguttahalli of Mu Sigma, a Decision Sciences and Big Data analytics company, for the guest post below.

How-to: Get Started with Sentry in Hive

A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive

One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.

How-to: Use the HBase Thrift Interface, Part 2: Inserting/Getting Rows

The second how-to in a series about using the Apache HBase Thrift API

Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time. 

Working with Tables

How-to: Index and Search Data with Hue’s Search App

You can use Hue and Cloudera Search to build your own integrated Big Data search app.

In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.

Indexing Data in Cloudera Search

How-to: Shorten Your Oozie Workflow Definitions

While XML is very good for standardizing the way Apache Oozie workflows are written, it’s also known for being very verbose. Unfortunately, that means that for workflows that have many actions, your workflow.xml can easily become quite long and difficult to manage and read. Cloudera is constantly making improvements to address this issue, and in this how-to, you’ll get a quick look at some of the current features and tricks that you can use to help shorten your Oozie workflow definitions.

The Sub-Workflow Action

One of the more interesting action types that Oozie has is the Sub-Workflow Action; it allows you to run another workflow from your workflow. Suppose you have a workflow where you’d like to use the same action multiple times; this is not usually allowed because Oozie workflows are Direct Acyclic Graphs (DAG) and so actions cannot be executed more than once as part of a workflow. However, if you put that action into its own workflow, you can actually call it multiple times from within the same workflow by using the Sub-Workflow Action. So, instead of copying and pasting the same action to be able to use it multiple times (and taking up a lot of extra space), you can just use the Sub-Workflow Action, which could be shorter; it is also easier to maintain because if you ever want to change that action, you only have to do it in one place. You also get the advantage of being able to use that action in other workflows. Of course, you can still put multiple actions in your sub-workflow.

We’re always looking for new ways to improve the usability of Oozie and of the workflow format.

How-to: Add Cloudera Search to Your Cluster using Cloudera Manager

Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).

In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.

Installing the SOLR Parcel

How-to: Use HBase Bulk Loading, and Why

Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.

This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.

Overview of Bulk Loading

Newer Posts Older Posts