Cloudera Engineering Blog · How-to Posts
Cloudera provides docs and a sample build environment to help you get easily started writing your own Impala UDFs.
User-defined functions (UDFs) let you code your own application logic for processing column values during a Cloudera Impala query. For example, a UDF could perform calculations using an external math library, combine several column values into one, do geospatial calculations, or other kinds of tests and transformations that are outside the scope of the built-in SQL operators and functions.
Developers, rejoice: Impala is now available on EMR for testing and evaluation.
Very recently, Amazon Web Services announced support for running Cloudera Impala queries on its Elastic MapReduce (EMR) service. This is very good news for EMR users — as well as for users of other platforms interested in kicking Impala’s tires in a friction-free way. It’s also yet another sign that Impala is rapidly being adopted across the ecosystem as the gold standard for interactive SQL and BI queries on Apache Hadoop.
The new RImpala package brings the speed and interactivity of Impala to queries from R.
Our thanks to Austin Chungath, Sachin Sudarshana, and Vikas Raguttahalli of Mu Sigma, a Decision Sciences and Big Data analytics company, for the guest post below.
A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive
One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.
The second how-to in a series about using the Apache HBase Thrift API
Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time.
Working with Tables
You can use Hue and Cloudera Search to build your own integrated Big Data search app.
In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.
Indexing Data in Cloudera Search
While XML is very good for standardizing the way Apache Oozie workflows are written, it’s also known for being very verbose. Unfortunately, that means that for workflows that have many actions, your workflow.xml can easily become quite long and difficult to manage and read. Cloudera is constantly making improvements to address this issue, and in this how-to, you’ll get a quick look at some of the current features and tricks that you can use to help shorten your Oozie workflow definitions.
The Sub-Workflow Action
One of the more interesting action types that Oozie has is the Sub-Workflow Action; it allows you to run another workflow from your workflow. Suppose you have a workflow where you’d like to use the same action multiple times; this is not usually allowed because Oozie workflows are Direct Acyclic Graphs (DAG) and so actions cannot be executed more than once as part of a workflow. However, if you put that action into its own workflow, you can actually call it multiple times from within the same workflow by using the Sub-Workflow Action. So, instead of copying and pasting the same action to be able to use it multiple times (and taking up a lot of extra space), you can just use the Sub-Workflow Action, which could be shorter; it is also easier to maintain because if you ever want to change that action, you only have to do it in one place. You also get the advantage of being able to use that action in other workflows. Of course, you can still put multiple actions in your sub-workflow.
We’re always looking for new ways to improve the usability of Oozie and of the workflow format.
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.
Installing the SOLR Parcel
Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.
This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.
Overview of Bulk Loading
Those people have two main options: One is the Thrift interface (the more lightweight and hence faster of the two options), and the other is the REST interface (aka Stargate). A REST interface uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface. (If you’d like more information about the REST interface, you can go to my series of how-to’s about it.)