Cloudera Developer Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive
One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.
Thanks to Marshall Bockrath-Vandegrift of advanced threat detection/malware company (and CDH user) Damballa for the following post about his Parkour project, which offers libraries for writing MapReduce jobs in Clojure. Parkour has been tested (but is not supported) on CDH 3 and CDH 4.
Clojure is Lisp-family functional programming language which targets the JVM. On the Damballa R&D team, Clojure has become the language of choice for implementing everything from web services to machine learning systems. One of Clojure’s key features for us is that it was designed from the start as an explicitly hosted language, building on rather than replacing the semantics of its underlying platform. Clojure’s mapping from language features to JVM implementation is frequently simpler and clearer even than Java’s.
The second how-to in a series about using the Apache HBase Thrift API
Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time.
Working with Tables
An overview of some of Cloudera’s contributions to YARN that help support management of multiple resources, from multi resource scheduling in the Fair Schedule to node-level enforcement
As Apache Hadoop become ubiquitous, it is becoming more common for users to run diverse sets of workloads on Hadoop, and these jobs are more likely to have different resource profiles. For example, a MapReduce distcp job or Cloudera Impala query that does a simple scan on a large table may be heavily disk-bound and require little memory. Or, an Apache Spark (incubating) job executing an iterative machine-learning algorithm with complex updates may wish to store the entire dataset in memory and use spurts of CPU to perform complex computation on it.
Some things for which we are thankful, the 2013 edition (not listed in order):
1. The entire Apache Hadoop community for its constant and hard work to Make the Platform Better,
You can use Hue and Cloudera Search to build your own integrated Big Data search app.
In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.
Indexing Data in Cloudera Search
Get an overview of the available mechanisms for backing up data stored in Apache HBase, and how to restore that data in the event of various data recovery/failover scenarios
With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.
Our thanks to Databricks, the company behind Apache Spark (incubating), for providing the guest post below. Cloudera and Databricks recently announced that Cloudera will distribute and support Spark in CDH. Look for more posts describing Spark internals and Spark + CDH use cases in the near future.
While XML is very good for standardizing the way Apache Oozie workflows are written, it’s also known for being very verbose. Unfortunately, that means that for workflows that have many actions, your workflow.xml can easily become quite long and difficult to manage and read. Cloudera is constantly making improvements to address this issue, and in this how-to, you’ll get a quick look at some of the current features and tricks that you can use to help shorten your Oozie workflow definitions.
The Sub-Workflow Action
One of the more interesting action types that Oozie has is the Sub-Workflow Action; it allows you to run another workflow from your workflow. Suppose you have a workflow where you’d like to use the same action multiple times; this is not usually allowed because Oozie workflows are Direct Acyclic Graphs (DAG) and so actions cannot be executed more than once as part of a workflow. However, if you put that action into its own workflow, you can actually call it multiple times from within the same workflow by using the Sub-Workflow Action. So, instead of copying and pasting the same action to be able to use it multiple times (and taking up a lot of extra space), you can just use the Sub-Workflow Action, which could be shorter; it is also easier to maintain because if you ever want to change that action, you only have to do it in one place. You also get the advantage of being able to use that action in other workflows. Of course, you can still put multiple actions in your sub-workflow.
We’re always looking for new ways to improve the usability of Oozie and of the workflow format.
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.