Cloudera Engineering Blog · HBase Posts
To design effective fraud-detection architecture, look no further than the human brain (with some help from Spark Streaming and Apache Kafka).
At its core, fraud detection is about detection whether people are behaving “as they should,” otherwise known as catching anomalies in a stream of events. This goal is reflected in diverse applications such as detecting credit-card fraud, flagging patients who are doctor shopping to obtain a supply of prescription drugs, or identifying bullies in online gaming communities.
Thrift client authentication and
doAs impersonation, introduced in HBase 1.0, provides more flexibility for your HBase installation.
In the two-part blog series “How-to: Use the HBase Thrift Interface” (Part 1 and Part 2), Jesse Anderson explained the Thrift interface in detail, and demonstrated how to use it. He didn’t cover running Thrift in a secure Apache HBase cluster, however, because there was no difference in the client configuration with the HBase releases available at that time.
Thanks to Pengyu Wang, software developer at FINRA, for permission to republish this post.
Salted Apache HBase tables with pre-split is a proven effective HBase solution to provide uniform workload distribution across RegionServers and prevent hot spots during bulk writes. In this design, a row key is made with a logical key plus salt at the beginning. One way of generating salt is by calculating n (number of regions) modulo on the hash code of the logical row key (date, etc).
Salting Row Keys
Learn about the design decisions behind HBase’s new support for MOBs.
Apache HBase is a distributed, scalable, performant, consistent key value database that can store a variety of binary data types. It excels at storing many relatively small values (<10K), and providing low-latency reads and writes.
Evaluating which streaming architectural pattern is the best match to your use case is a precondition for a successful production deployment.
The Apache Hadoop ecosystem has become a preferred platform for enterprises seeking to process and understand large-scale data in real time. Technologies like Apache Kafka, Apache Flume, Apache Spark, Apache Storm, and Apache Samza are increasingly pushing the envelope on what is possible. It is often tempting to bucket large-scale streaming use cases together but in reality they tend to break down into a few different architectural patterns, with different components of the ecosystem better suited for different problems.
The following post about the new request throttling feature in HBase 1.1 (now shipping in CDH 5.4) originally published in the ASF blog. We re-publish it here for your convenience.
Running multiple workloads on HBase has always been challenging, especially when trying to execute real-time workloads while concurrently running analytical jobs. One possible way to address this issue is to throttle analytical MR jobs so that real-time workloads are less affected.
The following post, from Cloudera intern Jonathan Lawlor, originally appeared in the Apache Software Foundation’s blog.
Over the past few months there have a been a variety of nice changes made to scanners in Apache HBase. This post focuses on two such changes, namely RPC chunking (HBASE-11544) and scanner heartbeat messages (HBASE-13090). Both of these changes address long standing issues in the client-server scan protocol. Specifically, RPC chunking deals with how a server handles the scanning of very large rows and scanner heartbeat messages allow scan operations to progress even when aggressive server-side filtering makes infrequent result returns.
We are happy to announce the inclusion of Apache Phoenix in Cloudera Labs.
This year’s HBaseCon Use Cases track includes war stories about some of the world’s best examples of running Apache HBase in production.
As a final sneak preview leading up to the show next week, in this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Use Cases track.
This year’s HBaseCon Ecosystem track covers projects that are complementary to HBase (with a focus on SQL) such as Apache Phoenix, Apache Kylin, and Trafodion.
In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Ecosystem track.
This year’s HBaseCon Development & Internals track covers new features in HBase 1.0, what’s to come in 2.0, best practices for tuning, and more.
In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Development & Internals track.
This year’s HBaseCon Operations track features some of the world’s largest and most impressive operators.
In this post, I’ll give you a window into the HBaseCon 2015′s (May 7 in San Francisco) Operations track.
As is its tradition, this year’s HBaseCon General Session includes keynotes about the world’s most awesome HBase deployments.
It’s Spring, which also means that it’s HBaseCon season—the time when the Apache HBase community gathers for its annual ritual.
The Cloudera HBase Team are proud to be members of Apache HBase’s model community and are currently AWOL, busy celebrating the release of the milestone Apache HBase 1.0. The following, from release manager Enis Soztutar, was published today in the ASF’s blog.
HBaseCon 2015 is ON, people! Book Thursday, May 7, in your calendars.
If you’re a developer in Silicon Valley, you probably already know that since its debut in 2012, HBaseCon has been one of the best developer community conferences out there. If you’re not, this is a great opportunity to learn that for yourself: HBaseCon 2015 will occur on Thurs., May 7, 2015, at the Westin St. Francis on Union Square in San Francisco.
As we progressively move from MapReduce to Spark, we shouldn’t have to give up good HBase integration. Hence the newest Cloudera Labs project, SparkOnHBase!
Apache Spark is making a huge impact across our industry, changing the way we think about batch processing and stream processing. However, as we progressively migrate from MapReduce toward Spark, we shouldn’t have to “give up” anything. One of those capabilities we need to retain is the ability to interact with Apache HBase.
These new Apache HBase features in CDH 5.2 make multi-tenant environments easier to manage.
Historically, Apache HBase treats all tables, users, and workloads with equal weight. This approach is sufficient for a single workload, but when multiple users and multiple workloads were applied on the same cluster or table, conflicts can arise. Fortunately, starting with HBase in CDH 5.2 (HBase 0.98 + backports), workloads and users can now be prioritized.
This guest post from Intel Java performance architect Eric Kaczmarek (originally published here) explores how to tune Java garbage collection (GC) for Apache HBase focusing on 100% YCSB reads.
Apache HBase is an Apache open source project offering NoSQL data storage. Often used together with HDFS, HBase is widely used across the world. Well-known users include Facebook, Twitter, Yahoo, and more. From the developer’s perspective, HBase is a “distributed, versioned, non-relational database modeled after Google’s Bigtable, a distributed storage system for structured data”. HBase can easily handle very high throughput by either scaling up (i.e., deployment on a larger server) or scaling out (i.e., deployment on more servers).
The number of powerful data query tools in the Apache Hadoop ecosystem can be confusing, but understanding a few simple things about your needs usually makes the choice easy.
Ah, the good old days. I recall vividly that in 2007, I was faced to store 1 billion XML documents and make them accessible as well as searchable. I had few choices on a given shoestring budget: build something one my own (it was the rage back then—and still is), use an existing open source database like PostgreSQL or MySQL, or try this thing that Google built successfully and that was now implemented in open source under the Apache umbrella: Hadoop.
An update on community efforts to bring at-rest encryption to HDFS — a major theme of Project Rhino.
Encryption is a key requirement for many privacy and security-sensitive industries, including healthcare (HIPAA regulations), card payments (PCI DSS regulations), and the US government (FISMA regulations).
Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.
Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level; all you get is a filesystem and whatever else you can dream up — well, code up).
HBaseCon 2014 is in the books. Thanks to all attendees, speakers, and sponsors!
Thanks to Jonathan Natkins of WibiData for the post below about how his company extended Cloudera Manager to manage Kiji. Learn more about Kiji and the organizations using it to build real-time HBase applications at Kiji Sessions, happening on May 6, 2014, the day after HBaseCon.
As a partner of Cloudera, WibiData sees Cloudera Manager’s new extensibility framework as one of the most exciting parts of Cloudera Enterprise 5. Cloudera Manager 5.0.0 provides the single-pane view that Apache Hadoop administrators and operators want to effectively manage a cluster of machines. Additionally, Cloudera Manager now offers tight integration for partners to plug into the CDH ecosystem, which benefits Cloudera as well as WibiData.
The HBaseCon 2014 “Case Studies” track surfaces some of the most interesting (and diverse) use cases in the HBase ecosystem — and in the world of NoSQL overall — today.
The HBaseCon 2014 (May 5, 2014 in San Francisco) is not just about internals and best practices — it’s also a place to explore use cases that you not have even considered before.
The HBaseCon 2014 “Ecosystem” track offers a cross-section view of the most interesting projects emerging on top of, or alongside, HBase.
The HBaseCon 2014 “Features & Internals” track covers the newest developments in Apache HBase functionality.
The conclusion to this series covers how to use scans, and considerations for choosing the Thrift or REST APIs.
In this series of how-tos, you have learned how to use Apache HBase’s Thrift interface. Part 1 covered the basics of the API, working with Thrift, and some boilerplate code for connecting to Thrift. Part 2 showed how to insert and to get multiple rows at a time. In this third and final post, you will learn how to use scans and some considerations when choosing between REST and Thrift.
Scanning with Thrift
HBaseCon 2014 “Operations” track reveals best practices used by some of the world’s largest production-cluster operators.
The HBaseCon 2014 General Session – with keynotes by Facebook, Google, and Salesforce.com engineers – is arguably the best ever.
HBaseCon 2014 (May 5, 2014 in San Francisco) is coming very, very soon. Over the next few weeks, as I did for last year’s conference, I’ll be bringing you sneak previews of session content (across Operations, Features & Internals, Ecosystem, and Case Studies tracks) accepted by the Program Committee.
Users of diverse, real-world HBase deployments around the world present at this year’s event.
This year’s agenda for HBaseCon, the conference for the Apache HBase community (developers, operators, contributors), looks “Stack-ed” with can’t-miss keynotes and breakouts. Program committee, you really came through (again).
Cloudera’s own enterprise data hub is yielding great results for providing world-class customer support.
Here at Cloudera, we are constantly pushing the envelope to give our customers world-class support. One of the cornerstones of this effort is the Cloudera Support Interface (CSI), which we’ve described in prior blog posts (here and here). Through CSI, our support team is able to quickly reason about a customer’s environment, search for information related to a case currently being worked, and much more.
These suggestions from the Program Committee offer an inside track to getting your talk accepted!
With HBaseCon 2014 (in San Francisco on May 5) Call for Papers closing in just over three weeks (on Feb. 14 — sooner than you think), there’s no better time than “now” to start thinking about your proposal.
The third-annual HBaseCon is now open for business. Submit your paper or register today for early bird savings!
Seems like only yesterday that droves of Apache HBase developers, committers/contributors, operators, and other enthusiasts converged in San Francisco for HBaseCon 2013 — nearly 800 of them, in fact.
With the close of 2013, we also thought it appropriate to include some high points from across the year (not listed in any particular order):
The compactions model is changing drastically with CDH 5/HBase 0.96. Here’s what you need to know.
Apache HBase is a distributed data store based upon a log-structured merge tree, so optimal read performance would come from having only one file per store (Column Family). However, that ideal isn’t possible during periods of heavy incoming writes. Instead, HBase will try to combine HFiles to reduce the maximum number of disk seeks needed for a read. This process is called compaction.
The second how-to in a series about using the Apache HBase Thrift API
Last time, we covered the fundamentals about connecting to Thrift via Python. This time, you’ll learn how to insert and get multiple rows at a time.
Working with Tables
Get an overview of the available mechanisms for backing up data stored in Apache HBase, and how to restore that data in the event of various data recovery/failover scenarios
With increased adoption and integration of HBase into critical business systems, many enterprises need to protect this important business asset by building out robust backup and disaster recovery (BDR) strategies for their HBase clusters. As daunting as it may sound to quickly and easily backup and restore potentially petabytes of data, HBase and the Apache Hadoop ecosystem provide many built-in mechanisms to accomplish just that.
Cloudera Manager 4.7 added support for managing Cloudera Search 1.0. Thus Cloudera Manager users can easily deploy all components of Cloudera Search (including Apache Solr) and manage all related services, just like every other service included in CDH (Cloudera’s distribution of Apache Hadoop and related projects).
In this how-to, you will learn the steps involved in adding Cloudera Search to a Cloudera Enterprise (CDH + Cloudera Manager) cluster.
Installing the SOLR Parcel
We at Cloudera University have been busy lately, building and expanding our courses to help data professionals succeed. We’ve expanded the Hadoop Administrator course and created a new Data Analyst course. Now we’ve updated and relaunched our course on Apache HBase to help more organizations adopt Hadoop’s real-time Big Data store as a competitive advantage.
The course is designed to make sure developers and administrators with an HBase use case can start realizing value from day one. We doubled the length of the curriculum to four days, allowing a deep dive into HBase operations as well as development.
In my previous post you learned how to index email messages in batch mode, and in near real time, using Apache Flume with MorphlineSolrSink. In this post, you will learn how to index emails using Cloudera Search with Apache HBase and Lily HBase Indexer, maintained by NGDATA and Cloudera. (If you have not read the previous post, I recommend you do so for background before reading on.)
Which near-real-time method to choose, HBase Indexer or Flume MorphlineSolrSink, will depend entirely on your use case, but below are some things to consider when making that decision:
Apache ZooKeeper is a client/server system for distributed coordination that exposes an interface similar to a filesystem, where each node (called a znode) may contain data and a set of children. Each znode has a name and can be identified using a filesystem-like path (for example, /root-znode/sub-znode/my-znode).
In Apache HBase, ZooKeeper coordinates, communicates, and shares state between the Masters and RegionServers. HBase has a design policy of using ZooKeeper only for transient data (that is, for coordination and state communication). Thus if the HBase’s ZooKeeper data is removed, only the transient operations are affected – data can continue to be written and read to/from HBase.
The following post, by Apache HBase 0.96 Release Manager/Cloudera Software Engineer Michael Stack, was published originally at blogs.apache.org and is provided below for your convenience. Our thanks to the release’s numerous contributors!
Note: HBase 0.96 will be packaged in the next release of CDH (CDH 5).
The following guest post is provided by Artur Barseghyan, a web developer currently employed by Goldmund, Wyldebeast & Wunderliebe in The Netherlands.
Python is my personal (and primary) programming language of choice and also happens to be the primary programming language at my company. So, when starting to work with a new technology, I prefer to use a clean and easy (Pythonic!) API.
In December 2012, we described how an internal application built on CDH called Cloudera Support Interface (CSI), which drastically improves Cloudera’s ability to optimally support our customers, is a unique and instructive use case for Apache Hadoop. In this post, we’ll follow up by describing two new differentiating CSI capabilities that have made Cloudera Support yet more responsive for customers:
Apache HBase is all about giving you random, real-time, read/write access to your Big Data, but how do you efficiently get that data into HBase in the first place? Intuitively, a new user will try to do that via the client APIs or by using a MapReduce job with TableOutputFormat, but those approaches are problematic, as you will learn below. Instead, the HBase bulk loading feature is much easier to use and can insert the same amount of data more quickly.
This blog post will introduce the basic concepts of the bulk loading feature, present two use cases, and propose two examples.
Overview of Bulk Loading
Those people have two main options: One is the Thrift interface (the more lightweight and hence faster of the two options), and the other is the REST interface (aka Stargate). A REST interface uses HTTP verbs to perform an action. By using HTTP, a REST interface offers a much wider array of languages and programs that can access the interface. (If you’d like more information about the REST interface, you can go to my series of how-to’s about it.)
The following post was originally published by the Hue Team at the Hue blog in a slightly different form.
In this post, we’ll take a look at the new Apache HBase Browser App added in Hue 2.5 and which has improved significantly since then. To get the Hue HBase browser, grab Hue via CDH 4.4 packages, via Cloudera Manager, or build it directly from GitHub.
While Apache HBase adoption for building end-user applications has skyrocketed, many of those applications (and many apps generally) have not been well-tested. In this post, you’ll learn some of the ways this testing can easily be done.
We will start with unit testing via JUnit, then move on to using Mockito and Apache MRUnit, and then to using an HBase mini-cluster for integration testing. (The HBase codebase itself is tested via a mini-cluster, so why not tap into that for upstream applications, as well?)
Apache HBase supports three primary client APIs that developers can use to bind applications with HBase: the Java API, the REST API, and the Thrift API. Therefore, as developers build apps against HBase, it’s very important for them to be aware of the compatibility guidelines with respect to CDH.
This blog post will describe the efforts that go into protecting the experience of a developer using the Java API. Through its testing work, Cloudera allows developers to write code and sleep well at night, knowing that their code will remain compatible through supported upgrade paths.
The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)
Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.