Cloudera Developer Blog
Big Data best practices, how-to's, and internals from Cloudera Engineering and the community
Flavio Junqueira (PMC Chair of the Apache ZooKeeper project and a member of the Systems and Networking Group at Microsoft Research) and Benjamin Reed (PMC Member and Software Engineer at Facebook) are the co-authors of the new O’Reilly Media book ZooKeeper: Distributed Process Coordination. We had a chat with Flavio and Ben recently about the rationale for writing the book, and what it will add to the distributed systems conversation.
Why did you decide to write this book?
Learn the new features and enhancements in Cloudera Manager 5, including support for YARN, management of third-party apps and frameworks, and more.
The response to the Oct. 2013 release of Cloudera Enterprise 5 Beta has been overwhelming, and Cloudera is busily working closely with several customers to incorporate their feedback.
Cloudera Manager 5 is a key part of this release, and in this post, I will provide a brief overview of some key features in Beta 1 as well as introduce some of those planned for Beta 2 (to be released in early 2014).
Workload and Resource Management
The compactions model is changing drastically with CDH 5/HBase 0.96. Here’s what you need to know.
Apache HBase is a distributed data store based upon a log-structured merge tree, so optimal read performance would come from having only one file per store (Column Family). However, that ideal isn’t possible during periods of heavy incoming writes. Instead, HBase will try to combine HFiles to reduce the maximum number of disk seeks needed for a read. This process is called compaction.
Compactions choose some files from a single store in a region and combine them. This process involves reading KeyValues in the input files and writing out any KeyValues that are not deleted, are inside of the time to live (TTL), and don’t violate the number of versions. The newly created combined file then replaces the input files in the region.
Welcome to our fifth edition of “This Month in the Ecosystem,” a digest of highlights from November 2013 (never intended to be comprehensive; for completeness, see the excellent Hadoop Weekly).
With the holidays upon us, the news in November was sparse. Even so, the ecosystem never stops churning!
A quick on-ramp (and demo) for using the new Sentry module for RBAC in conjunction with Hive
One attribute of the Enterprise Data Hub is fine-grained access to data by users and apps. This post about supporting infrastructure for that goal was originally published at blogs.apache.org. We republish it here for your convenience.
Apache Sentry (incubating) is a highly modular system for providing fine-grained role-based authorization to both data and metadata stored on an Apache Hadoop cluster. It currently works out of the box with Apache Hive and Cloudera Impala. In this blog post, you will learn how to use Sentry with Hive.
Thanks to Marshall Bockrath-Vandegrift of advanced threat detection/malware company (and CDH user) Damballa for the following post about his Parkour project, which offers libraries for writing MapReduce jobs in Clojure. Parkour has been tested (but is not supported) on CDH 3 and CDH 4.
Clojure is Lisp-family functional programming language which targets the JVM. On the Damballa R&D team, Clojure has become the language of choice for implementing everything from web services to machine learning systems. One of Clojure’s key features for us is that it was designed from the start as an explicitly hosted language, building on rather than replacing the semantics of its underlying platform. Clojure’s mapping from language features to JVM implementation is frequently simpler and clearer even than Java’s.
Parkour is our new Clojure library that carries this philosophy to the Apache Hadoop’s MapReduce platform. Instead of hiding the underlying MapReduce model behind new framework abstractions, Parkour exposes that model with a clear, direct interface. Everything possible in raw Java MapReduce is possible with Parkour, but usually with a fraction of the code.
The second how-to in a series about using the Apache HBase Thrift API
Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time.
Working with Tables
Using the Thrift interface, you can create or delete tables. Let’s take a look at the Python code that creates a table:
An overview of some of Cloudera’s contributions to YARN that help support management of multiple resources, from multi resource scheduling in the Fair Schedule to node-level enforcement
As Apache Hadoop become ubiquitous, it is becoming more common for users to run diverse sets of workloads on Hadoop, and these jobs are more likely to have different resource profiles. For example, a MapReduce distcp job or Cloudera Impala query that does a simple scan on a large table may be heavily disk-bound and require little memory. Or, an Apache Spark (incubating) job executing an iterative machine-learning algorithm with complex updates may wish to store the entire dataset in memory and use spurts of CPU to perform complex computation on it.
For that reason, the new YARN framework in Hadoop 2 allows workloads to share cluster resources dynamically between a variety of processing frameworks, including MapReduce, Impala, and Spark. YARN currently handles memory and CPU and will coordinate additional resources like disk and network I/O in the future.
Some things for which we are thankful, the 2013 edition (not listed in order):
1. The entire Apache Hadoop community for its constant and hard work to Make the Platform Better,
2. Cloudera’s users, customers, and partners for their continual and helpful feedback to help guide us through #1,
You can use Hue and Cloudera Search to build your own integrated Big Data search app.
In a previous post, you learned how to analyze data using Apache Hive via Hue’s Beeswax and Catalog apps. This time, you’ll see how to make Yelp Dataset Challenge data searchable by indexing it and building a customizable UI with the Hue Search app.
Indexing Data in Cloudera Search
Indexing data in Cloudera Search involves :