Cloudera Developer Blog · HBase Posts
The compactions model is changing drastically with CDH 5/HBase 0.96. Here’s what you need to know.
Apache HBase is a distributed data store based upon a log-structured merge tree, so optimal read performance would come from having only one file per store (Column Family). However, that ideal isn’t possible during periods of heavy incoming writes. Instead, HBase will try to combine HFiles to reduce the maximum number of disk seeks needed for a read. This process is called compaction.
Compactions choose some files from a single store in a region and combine them. This process involves reading KeyValues in the input files and writing out any KeyValues that are not deleted, are inside of the time to live (TTL), and don’t violate the number of versions. The newly created combined file then replaces the input files in the region.
The second how-to in a series about using the Apache HBase Thrift API
Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time.
Working with Tables
Using the Thrift interface, you can create or delete tables. Let’s take a look at the Python code that creates a table:
Get an overview of the available mechanisms for backing up data stored in Apache HBase, and how to restore that data in the event of various data recovery/failover scenarios
With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.
In this post, you will get a high-level overview of the available mechanisms for backing up data stored in HBase, and how to restore that data in the event of various data recovery/failover scenarios. After reading this post, you should be able to make an educated decision on which BDR strategy is best for your business needs. You should also understand the pros, cons, and performance implications of each mechanism. (The details herein apply to CDH 4.3.0/HBase 0.94.6 and later.)
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.
Installing the SOLR Parcel
In our example, the cluster uses a CDH 4.4 parcel and is running Apache ZooKeeper, HDFS, and Apache HBase services. (Parcels are a really useful way to deploy new software and do painless upgrades via Cloudera Manager.)
We at Cloudera University have been busy lately, building and expanding our courses to help data professionals succeed. We’ve expanded the Hadoop Administrator course and created a new Data Analyst course. Now we’ve updated and relaunched our course on Apache HBase to help more organizations adopt Hadoop’s real-time Big Data store as a competitive advantage.
The course is designed to make sure developers and administrators with an HBase use case can start realizing value from day one. We doubled the length of the curriculum to four days, allowing a deep dive into HBase operations as well as development.
As the primary course author, I had the pleasure of interviewing some of the most notable members of the HBase community. People like Michael Stack, Lars George, and Amandeep Khurana have written the books, contributed code, and deployed and supported huge clusters in production. I also tried to capture many of the key insights that otherwise only exist in HBase’s tribal knowledge, some of which I discuss in my recent blog posts on the REST Interface and the Thrift Interface, as well as in the Simple User Access chapter of the Apache HBase Reference Guide.
Beyond the Tribe
In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)
Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:
Apache ZooKeeper is a client/server system for distributed coordination that exposes an interface similar to a filesystem, where each node (called a znode) may contain data and a set of children. Each znode has a name and can be identified using a filesystem-like path (for example, /root-znode/sub-znode/my-znode).
In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is, for coordination and state communication). Thus if the HBase’s ZooKeeper data is removed, only the transient operations are affected – data can continue to be written and read to/from HBase.
In this blog post, you will get a short tour of HBase znodes usage. The version of HBase used for reference here is 0.94 (shipped inside CDH 4.2 and CDH 4.3), but most of the znodes are present in previous versions and also likely to be so in future versions.
The following post, by Apache HBase 0.96 Release Manager/Cloudera Software Engineer Michael Stack, was published originally at blogs.apache.org and is provided below for your convenience. Our thanks to the release’s numerous contributors!
Note: HBase 0.96 will be packaged in the next release of CDH (CDH 5).
The following guest post is provided by Artur Barseghyan, a web developer currently employed by Goldmund, Wyldebeast & Wunderliebe in The Netherlands.
Python is my personal (and primary) programming language of choice and also happens to be the primary programming language at my company. So, when starting to work with a new technology, I prefer to use a clean and easy (Pythonic!) API.
After studying tons of articles on the web, reading (and writing) white papers, and doing basic performance tests (sometimes hard if you’re on a tight schedule), my company recently selected Cloudera for our Big Data platform (including using Apache HBase as our data store for Apache Hadoop), with Cloudera Manager serving a role as “one console to rule them all.”
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers: