Cloudera Engineering Blog · HBase Posts
This new open source complement to HDFS and Apache HBase is designed to fill gaps in Hadoop’s storage layer that have given rise to stitched-together, hybrid architectures.
The set of data storage and processing technologies that define the Apache Hadoop ecosystem are expansive and ever-improving, covering a very diverse set of customer use cases used in mission-critical enterprise applications. At Cloudera, we’re constantly pushing the boundaries of what’s possible with Hadoop—making it faster, easier to work with, and more secure.
YCSB, the open standard for comparative performance evaluation of data stores, is now available to CDH users for their Apache HBase deployments via new packages from Cloudera Labs.
Many factors go into deciding which data store should be used for production applications, including basic features, data model, and the performance characteristics for a given type of workload. It’s critical to have the ability to compare multiple data stores intelligently and objectively so that you can make sound architectural decisions.
The SparkOnHBase project in Cloudera Labs was recently merged into the Apache HBase trunk. In this post, learn the project’s history and what the future looks like for the new HBase-Spark module.
SparkOnHBase was first pushed to Github on July 2014, just six months after Spark Summit 2013 and five months after Apache Spark first shipped in CDH. That conference was a big turning point for me, because for the first time I realized that the MapReduce engine had a very strong competitor. Spark was about to enter an exciting new phase in its open source life cycle, and just one year later, it’s used at massive scale at 100s if not 1000s of companies (with 200+ of them doing so on Cloudera’s platform).
Learn about the near real-time data ingest architecture for transforming and enriching data streams using Apache Flume, Apache Kafka, and RocksDB at Santander UK.
Cloudera Professional Services has been working with Santander UK to build a near real-time (NRT) transactional analytics system on Apache Hadoop. The objective is to capture, transform, enrich, count, and store a transaction within a few seconds of a card purchase taking place. The system receives the bank’s retail customer card transactions and calculates the associated trend information aggregated by account holder and over a number of dimensions and taxonomies. This information is then served securely to Santander’s “Spendlytics” app (see below) to enable customers to analyze their latest spending patterns.
To design effective fraud-detection architecture, look no further than the human brain (with some help from Spark Streaming and Apache Kafka).
At its core, fraud detection is about detection whether people are behaving “as they should,” otherwise known as catching anomalies in a stream of events. This goal is reflected in diverse applications such as detecting credit-card fraud, flagging patients who are doctor shopping to obtain a supply of prescription drugs, or identifying bullies in online gaming communities.
Thrift client authentication and
doAs impersonation, introduced in HBase 1.0, provides more flexibility for your HBase installation.
In the two-part blog series “How-to: Use the HBase Thrift Interface” (Part 1 and Part 2), Jesse Anderson explained the Thrift interface in detail, and demonstrated how to use it. He didn’t cover running Thrift in a secure Apache HBase cluster, however, because there was no difference in the client configuration with the HBase releases available at that time.
Thanks to Pengyu Wang, software developer at FINRA, for permission to republish this post.
Salted Apache HBase tables with pre-split is a proven effective HBase solution to provide uniform workload distribution across RegionServers and prevent hot spots during bulk writes. In this design, a row key is made with a logical key plus salt at the beginning. One way of generating salt is by calculating n (number of regions) modulo on the hash code of the logical row key (date, etc).
Salting Row Keys
Learn about the design decisions behind HBase’s new support for MOBs.
Apache HBase is a distributed, scalable, performant, consistent key value database that can store a variety of binary data types. It excels at storing many relatively small values (<10K), and providing low-latency reads and writes.
Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.
The following post about the new request throttling feature in HBase 1.1 (now shipping in CDH 5.4) originally published in the ASF blog. We re-publish it here for your convenience.
Running multiple workloads on HBase has always been challenging, especially when trying to execute real-time workloads while concurrently running analytical jobs. One possible way to address this issue is to throttle analytical MR jobs so that real-time workloads are less affected.
The following post, from Cloudera intern Jonathan Lawlor, originally appeared in the Apache Software Foundation’s blog.
Over the past few months there have a been a variety of nice changes made to scanners in Apache HBase. This post focuses on two such changes, namely RPC chunking (HBASE-11544) and scanner heartbeat messages (HBASE-13090). Both of these changes address long standing issues in the client-server scan protocol. Specifically, RPC chunking deals with how a server handles the scanning of very large rows and scanner heartbeat messages allow scan operations to progress even when aggressive server-side filtering makes infrequent result returns.
We are happy to announce the inclusion of Apache Phoenix in Cloudera Labs.
This year’s HBaseCon Use Cases track includes war stories about some of the world’s best examples of running Apache HBase in production.
As a final sneak preview leading up to the show next week, in this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Use Cases track.
This year’s HBaseCon Ecosystem track covers projects that are complementary to HBase (with a focus on SQL) such as Apache Phoenix, Apache Kylin, and Trafodion.
In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Ecosystem track.
This year’s HBaseCon Development & Internals track covers new features in HBase 1.0, what’s to come in 2.0, best practices for tuning, and more.
In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Development & Internals track.
This year’s HBaseCon Operations track features some of the world’s largest and most impressive operators.
In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Operations track.
As is its tradition, this year’s HBaseCon General Session includes keynotes about the world’s most awesome HBase deployments.
It’s Spring, which also means that it’s HBaseCon season—the time when the Apache HBase community gathers for its annual ritual.
The Cloudera HBase Team are proud to be members of Apache HBase’s model community and are currently AWOL, busy celebrating the release of the milestone Apache HBase 1.0. The following, from release manager Enis Soztutar, was published today in the ASF’s blog.
HBaseCon 2015 is ON, people! Book Thursday, May 7, in your calendars.
If you’re a developer in Silicon Valley, you probably already know that since its debut in 2012, HBaseCon has been one of the best developer community conferences out there. If you’re not, this is a great opportunity to learn that for yourself: HBaseCon 2015 will occur on Thurs., May 7, 2015, at the Westin St. Francis on Union Square in San Francisco.
As we progressively move from MapReduce to Spark, we shouldn’t have to give up good HBase integration. Hence the newest Cloudera Labs project, SparkOnHBase!
[Ed. Note: In Aug. 2015, SparkOnHBase was committed to the Apache HBase trunk in the form of a new HBase-Spark module.]
These new Apache HBase features in CDH 5.2 make multi-tenant environments easier to manage.
Historically, Apache HBase treats all tables, users, and workloads with equal weight. This approach is sufficient for a single workload, but when multiple users and multiple workloads were applied on the same cluster or table, conflicts can arise. Fortunately, starting with HBase in CDH 5.2 (HBase 0.98 + backports), workloads and users can now be prioritized.
This guest post from Intel Java performance architect Eric Kaczmarek (originally published here) explores how to tune Java garbage collection (GC) for Apache HBase focusing on 100% YCSB reads.
Apache HBase is an Apache open source project offering NoSQL data storage. Often used together with HDFS, HBase is widely used across the world. Well-known users include Facebook, Twitter, Yahoo, and more. From the developer’s perspective, HBase is a “distributed, versioned, non-relational database modeled after Google’s Bigtable, a distributed storage system for structured data”. HBase can easily handle very high throughput by either scaling up (i.e., deployment on a larger server) or scaling out (i.e., deployment on more servers).
The number of powerful data query tools in the Apache Hadoop ecosystem can be confusing, but understanding a few simple things about your needs usually makes the choice easy.
Ah, the good old days. I recall vividly that in 2007, I was faced to store 1 billion XML documents and make them accessible as well as searchable. I had few choices on a given shoestring budget: build something one my own (it was the rage back then—and still is), use an existing open source database like PostgreSQL or MySQL, or try this thing that Google built successfully and that was now implemented in open source under the Apache umbrella: Hadoop.
An update on community efforts to bring at-rest encryption to HDFS — a major theme of Project Rhino.
Encryption is a key requirement for many privacy and security-sensitive industries, including healthcare (HIPAA regulations), card payments (PCI DSS regulations), and the US government (FISMA regulations).
Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.
Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level; all you get is a filesystem and whatever else you can dream up — well, code up).
HBaseCon 2014 is in the books. Thanks to all attendees, speakers, and sponsors!
Thanks to Jonathan Natkins of WibiData for the post below about how his company extended Cloudera Manager to manage Kiji. Learn more about Kiji and the organizations using it to build real-time HBase applications at Kiji Sessions, happening on May 6, 2014, the day after HBaseCon.
As a partner of Cloudera, WibiData sees Cloudera Manager’s new extensibility framework as one of the most exciting parts of Cloudera Enterprise 5. Cloudera Manager 5.0.0 provides the single-pane view that Apache Hadoop administrators and operators want to effectively manage a cluster of machines. Additionally, Cloudera Manager now offers tight integration for partners to plug into the CDH ecosystem, which benefits Cloudera as well as WibiData.
The HBaseCon 2014 “Case Studies” track surfaces some of the most interesting (and diverse) use cases in the HBase ecosystem — and in the world of NoSQL overall — today.
The HBaseCon 2014 (May 5, 2014 in San Francisco) is not just about internals and best practices — it’s also a place to explore use cases that you not have even considered before.
The HBaseCon 2014 “Ecosystem” track offers a cross-section view of the most interesting projects emerging on top of, or alongside, HBase.
The HBaseCon 2014 “Features & Internals” track covers the newest developments in Apache HBase functionality.
The conclusion to this series covers how to use scans, and considerations for choosing the Thrift or REST APIs.
In this series of how-tos, you have learned how to use Apache HBase’s Thrift interface. Part 1 covered the basics of the API, working with Thrift, and some boilerplate code for connecting to Thrift. Part 2 showed how to insert and to get multiple rows at a time. In this third and final post, you will learn how to use scans and some considerations when choosing between REST and Thrift.
Scanning with Thrift
HBaseCon 2014 “Operations” track reveals best practices used by some of the world’s largest production-cluster operators.
The HBaseCon 2014 General Session – with keynotes by Facebook, Google, and Salesforce.com engineers – is arguably the best ever.
HBaseCon 2014 (May 5, 2014 in San Francisco) is coming very, very soon. Over the next few weeks, as I did for last year’s conference, I’ll be bringing you sneak previews of session content (across Operations, Features & Internals, Ecosystem, and Case Studies tracks) accepted by the Program Committee.
Users of diverse, real-world HBase deployments around the world present at this year’s event.
This year’s agenda for HBaseCon, the conference for the Apache HBase community (developers, operators, contributors), looks “Stack-ed” with can’t-miss keynotes and breakouts. Program committee, you really came through (again).
Cloudera’s own enterprise data hub is yielding great results for providing world-class customer support.
Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.
These suggestions from the Program Committee offer an inside track to getting your talk accepted!
With HBaseCon 2014 (in San Francisco on May 5) Call for Papers closing in just over three weeks (on Feb. 14 — sooner than you think), there’s no better time than “now” to start thinking about your proposal.
The third-annual HBaseCon is now open for business. Submit your paper or register today for early bird savings!
Seems like only yesterday that droves of Apache HBase developers, committers/contributors, operators, and other enthusiasts converged in San Francisco for HBaseCon 2013 — nearly 800 of them, in fact.
With the close of 2013, we also thought it appropriate to include some high points from across the year (not listed in any particular order):
The compactions model is changing drastically with CDH 5/HBase 0.96. Here’s what you need to know.
Apache HBase is a distributed data store based upon a log-structured merge tree, so optimal read performance would come from having only one file per store (Column Family). However, that ideal isn’t possible during periods of heavy incoming writes. Instead, HBase will try to combine HFiles to reduce the maximum number of disk seeks needed for a read. This process is called compaction.
The second how-to in a series about using the Apache HBase Thrift API
Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time.
Working with Tables
Get an overview of the available mechanisms for backing up data stored in Apache HBase, and how to restore that data in the event of various data recovery/failover scenarios
With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.
Installing the SOLR Parcel
We at Cloudera University have been busy lately, building and expanding our courses to help data professionals succeed. We’ve expanded the Hadoop Administrator course and created a new Data Analyst course. Now we’ve updated and relaunched our course on Apache HBase to help more organizations adopt Hadoop’s real-time Big Data store as a competitive advantage.
The course is designed to make sure developers and administrators with an HBase use case can start realizing value from day one. We doubled the length of the curriculum to four days, allowing a deep dive into HBase operations as well as development.
In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)
Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:
Apache ZooKeeper is a client/server system for distributed coordination that exposes an interface similar to a filesystem, where each node (called a znode) may contain data and a set of children. Each znode has a name and can be identified using a filesystem-like path (for example, /root-znode/sub-znode/my-znode).
In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is, for coordination and state communication). Thus if the HBase’s ZooKeeper data is removed, only the transient operations are affected – data can continue to be written and read to/from HBase.
The following post, by Apache HBase 0.96 Release Manager/Cloudera Software Engineer Michael Stack, was published originally at blogs.apache.org and is provided below for your convenience. Our thanks to the release’s numerous contributors!
Note: HBase 0.96 will be packaged in the next release of CDH (CDH 5).
The following guest post is provided by Artur Barseghyan, a web developer currently employed by Goldmund, Wyldebeast & Wunderliebe in The Netherlands.
Python is my personal (and primary) programming language of choice and also happens to be the primary programming language at my company. So, when starting to work with a new technology, I prefer to use a clean and easy (Pythonic!) API.
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.
This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.
Overview of Bulk Loading
Those people have two main options: One is the Thrift interface (the more lightweight and hence faster of the two options), and the other is the REST interface (aka Stargate). A REST interface uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface. (If you’d like more information about the REST interface, you can go to my series of how-to’s about it.)