Cloudera Blog · HBase Posts
Apache HBase AssignmentManager Improvements
AssignmentManager is a module in the Apache HBase Master that manages regions to RegionServers assignment. (See HBase architecture for more information.) It ensures that all regions are assigned and each region is assigned to just one RegionServer.
Although the AssignmentManager generally does a good job, the existing implementation does not handle assignments as well as it could. For example, if a region was assigned to two or more RegionServers, some regions were stuck in transition and never got assigned, or unknown region exceptions were thrown in moving a region from one RegionServer to another.
In the past we tried to fix these bugs without changing the underlying design. Consequently, the AssignmentManager ended up having many band-aids, and the code base became hard to understand/maintain. Furthermore, the underlying issues had not been completely fixed.
Streaming Data into Apache HBase using Apache Flume
The following post was originally published via blog.apache.org; we are re-publishing it here.
Apache Flume was conceived as a fault-tolerant ingest system for the Apache Hadoop ecosystem. Flume comes packaged with an HDFS Sink which can be used to write events into HDFS, and two different implementations of HBase sinks to write events into Apache HBase. You can read about the basic architecture of Apache Flume 1.x in this blog post. You can also read about how Flume’s File Channel persists events and still provides extremely high performance in an earlier blog post. In this article, we will explore how to configure Flume to write events into HBase, and write custom serializers to write events into HBase in a format of the user’s choice.
Data is stored in HBase as tables. Each table has one or more column families, and each column family has one or more columns. HBase stores all columns in a column family in physical proximity. Each row is identified by a key known as the row key. To insert data into HBase, the table name, column family, column name and row key have to be specified. More details on the HBase data model can be found in the HBase documentation.
Announcing the Kiji Project: An Open Source Framework for Building Big Data Applications with Apache HBase
- by Aaron Kimball
- November 14, 2012
- no comments
The following is a guest post from Aaron Kimball, who was Cloudera’s first engineer and the creator of the Apache Sqoop project. He is the Founder and CTO at WibiData, a San Francisco-based company building big data applications.
Our team at WibiData has been developing applications on Hadoop since 2010 and we’ve helped many organizations transform how they use data by deploying Hadoop. HBase in particular has allowed companies of all types to drive their business using scalable, high performance storage. Organizations have started to leverage these capabilities for various big data applications, including targeted content, personalized recommendations, enhanced customer experience and social network analysis.
While building many of these applications, we have seen emerging tools, design patterns and best practices repeated across projects. One of the clear lessons learned is that Hadoop and HBase provide very low-level interfaces. Each large-scale application we have built on top of Hadoop has required a great deal of scaffolding and data management code. This repetitive programming is tedious, error-prone, and makes application interoperability more challenging in the long run.
Cloudera Impala: Real-Time Queries in Apache Hadoop, For Real
- by Marcel Kornacker & Justin Erickson
- October 24, 2012
- 16 comments
After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.
When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.
Impala raises the bar for query performance while retaining a familiar user experience. With Impala, you can query data, whether stored in HDFS or Apache HBase – including SELECT, JOIN, and aggregate functions – in real time. Furthermore, it uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive, providing a familiar and unified platform for batch-oriented or real-time queries. (For that reason, Hive users can utilize Impala with little setup overhead.) The first beta drop includes support for text files and SequenceFiles; SequenceFiles can be compressed as Snappy, GZIP, and BZIP (with Snappy recommended for maximum performance). Support for additional formats including Avro, RCFile, LZO text files, and the Parquet columnar format is planned for the production drop.
HBase at ApacheCon Europe 2012
Apache HBase will have a notable profile at ApacheCon Europenext month. Clouderan and HBase committer Lars George has two sessions on the schedule:
Meet the Engineer: Todd Lipcon
In this installment of “Meet the Engineers”, meet Todd Lipcon (@tlipcon), PMC member/committer for the Hadoop, HBase, and Thrift projects.
What do you do at Cloudera, and in which Apache project are you involved?
New Additions to the Apache HBase Team
- by St.Ack
- October 21, 2012
- no comments
StumbleUpon (SU) and Cloudera have signed a technology collaboration agreement. Cloudera will support the SU clusters, and in exchange, Cloudera will have access to a variety of production deploys on which to study and try out beta software.
As part of the agreement, the StumbleUpon Apache HBase+Apache Hadoop team — Jean-Daniel Cryans, Elliott Clark and I — have joined Cloudera. From our new perch up in the Cloudera San Francisco office — 10 blocks north and 11 floors up — we will continue as first-level support for SU clusters, tending and optimizing them as we have always done. The rest of our time will be spent helping develop Apache HBase as the newest additions to Cloudera’s HBase team.
We do not foresee this transition disrupting our roles as contributors to HBase. If anything, we look forward to contributing even more than in the past.
CDH4.1 Now Released!
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
How-to: Enable User Authentication and Authorization in Apache HBase
With the default Apache HBase configuration, everyone is allowed to read from and write to all tables available in the system. For many enterprise setups, this kind of policy is unacceptable.
Administrators can set up firewalls that decide which machines are allowed to communicate with HBase. However, machines that can pass the firewall are still allowed to read from and write to all tables. This kind of mechanism is effective but insufficient because HBase still cannot differentiate between multiple users that use the same client machines, and there is still no granularity with regard to HBase table, column family, or column qualifier access.
In this post, we will discuss how Kerberos is used with Hadoop and HBase to provide User Authentication, and how HBase implements User Authorization to grant users permissions for particular actions on a specified set of data.
Secure HBase: Authentication & Authorization
Schedule This! Strata + Hadoop World Speakers from Cloudera
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
If you’re interested in community meetups as well, refer to my post from last week on that subject – several are planned.
| Session/Tutorial/Keynote Name | Presenter(s) | Date | Time |
| An Introduction to Hadoop | Mark Fei | Tues., Oct. 23 | 9am |
| Using HBase | Amandeep Khurana, Matteo Bertozzi | Tues., Oct. 23 | 9am |
| Testing Hadoop Applications | Tom Wheeler | Tues., Oct. 23 | 9am |
| Building a Large-scale Data Collection System Using Flume NG | Hari Shreedharan, Will McQueen, Arvind Prabhakar, Prasad Mujumdar, Mike Percy | Tues., Oct. 23 | 1:30pm |
| Given Enough Monkeys – Some Thoughts on Randomness | Jesse Anderson | Tues., Oct. 23 | 3:20pm |
| Keynote: Big Answers | Mike Olson | Weds., Oct. 24 | 8:55am |
| Large Scale ETL with Hadoop | Eric Sammer | Weds., Oct. 24 | 11:40am |
| HDFS – What is New and Future | Todd Lipcon (co-presenter) | Weds., Oct. 24 | 4:10pm |
| High Availability for the HDFS NameNode: Phase 2 | Aaron Myers, Todd Lipcon | Weds., Oct. 24 | 5pm |
| Plenary Session: Beyond Batch | Doug Cutting | Thurs., Oct. 25 | 9:20am |
| Upcoming Enterprise Features in Apache HBase 0.96 | Jon Hsieh | Thurs., Oct. 25 | 11:40am |
| Data Science on Hadoop: What’s There and What’s Missing | Justin Erickson | Thurs., Oct. 25 | 1:40pm |
| Taming the Elephant – Learn How Monsanto Manages Their Hadoop Cluster to Enable Genome/Sequence Processing | Bala Venkatrao, Aparna Ramani (with others) | Thurs., Oct. 25 | 4:10pm |
| Knitting Boar | Josh Patterson, Michael Katzenellenbogen | Thurs., Oct. 25 | 4:10pm |