Cloudera Engineering Blog · HBase Posts
Get an overview of the available mechanisms for backing up data stored in Apache HBase, and how to restore that data in the event of various data recovery/failover scenarios
With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.
Installing the SOLR Parcel
We at Cloudera University have been busy lately, building and expanding our courses to help data professionals succeed. We’ve expanded the Hadoop Administrator course and created a new Data Analyst course. Now we’ve updated and relaunched our course on Apache HBase to help more organizations adopt Hadoop’s real-time Big Data store as a competitive advantage.
The course is designed to make sure developers and administrators with an HBase use case can start realizing value from day one. We doubled the length of the curriculum to four days, allowing a deep dive into HBase operations as well as development.
In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)
Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:
Apache ZooKeeper is a client/server system for distributed coordination that exposes an interface similar to a filesystem, where each node (called a znode) may contain data and a set of children. Each znode has a name and can be identified using a filesystem-like path (for example, /root-znode/sub-znode/my-znode).
In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is, for coordination and state communication). Thus if the HBase’s ZooKeeper data is removed, only the transient operations are affected – data can continue to be written and read to/from HBase.
The following post, by Apache HBase 0.96 Release Manager/Cloudera Software Engineer Michael Stack, was published originally at blogs.apache.org and is provided below for your convenience. Our thanks to the release’s numerous contributors!
Note: HBase 0.96 will be packaged in the next release of CDH (CDH 5).
The following guest post is provided by Artur Barseghyan, a web developer currently employed by Goldmund, Wyldebeast & Wunderliebe in The Netherlands.
Python is my personal (and primary) programming language of choice and also happens to be the primary programming language at my company. So, when starting to work with a new technology, I prefer to use a clean and easy (Pythonic!) API.
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.
This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.
Overview of Bulk Loading
Those people have two main options: One is the Thrift interface (the more lightweight and hence faster of the two options), and the other is the REST interface (aka Stargate). A REST interface uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface. (If you’d like more information about the REST interface, you can go to my series of how-to’s about it.)