Cloudera Engineering Blog · HBase Posts
In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.
You can catch the full interview (video and transcript versions) here.
This is the second blogpost about Apache HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.
As mentioned in HBase Replication Overview, the master cluster sends shipment of WALEdits to one or more slave clusters. This section describes the steps needed to configure replication in a master-slave mode.
- All tables/column families that are to be replicated must exist on both the clusters.
- Add the following property in $HBASE_HOME/conf/hbase-site.xml on all nodes on both clusters; set it to true.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
Apache HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously, and these transactions are batched in a configurable size (default is 64MB). Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.
This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.
In the recent blog post about the Apache HBase Write Path, we talked about the write-ahead-log (WAL), which plays an important role in preventing data loss should a HBase region server failure occur. This blog post describes how HBase prevents data loss after a region server crashes, using an especially critical process for recovering lost updates called log splitting.
As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.
Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG), which is currently on the trunk branch, techniques and components that can be be used to route the data, configuration validation, and finally support for serializing events.
In the past several months, contributors have been busy adding several new sources, sinks and channels to Flume. Flume now supports Syslog as a source, where sources have been added to support Syslog over TCP and UDP.
Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access.
Wait wait? random, realtime read/write access?
How is that possible? Is not Hadoop just a sequential read/write, batch processing system?
HBaseCon 2012 summation provided by Michael Stack, PMC Chair of the Apache HBase Project. HBase Hack-a-thon summation provided by David Wang, Engineering Manager for the Cloudera HBase team.
HBaseCon 2012 Summation
Apache HBase is the Hadoop database, and is based on the Hadoop Distributed File System (HDFS). HBase makes it possible to randomly access and update data stored in HDFS, but files in HDFS can only be appended to and are immutable after they are created. So you may ask, how does HBase provide low-latency reads and writes? In this blog post, we explain this by describing the write path of HBase — how data is updated in HBase.
The write path is how an HBase completes put or delete operations. This path begins at a client, moves to a region server, and ends when data eventually is written to an HBase data file called an HFile. Included in the design of the write path are features that HBase uses to prevent data loss in the event of a region server failure. Therefore understanding the write path can provide insight into HBase’s native data loss prevention mechanism.
One of the major features of the upcoming Apache HBase 0.96 release is improved support for compatibility and extensibility across different HBase versions. This includes support for the following:
CopyTable is a simple Apache HBase utility that, unsurprisingly, can be used for copying individual tables within an HBase cluster or from one HBase cluster to another. In this blog post, we’ll talk about what this tool is, why you would want to use it, how to use it, and some common configuration caveats.
CopyTable is at its core an Apache Hadoop MapReduce job that uses the standard HBase Scan read-path interface to read records from an individual table and writes them to another table (possibly on a separate cluster) using the standard HBase Put write-path interface. It can be used for many purposes:
Apache HBase 0.94.0 has been released! This is the first major release since the January 22nd HBase 0.92 release. In the HBase 0.94.0 release the main focuses were on performance enhancements and the addition of new features (Also, several major bug fixes).
Performance Related JIRAs
Below are a few of the important performance related JIRAs:
This is a guest post by Assaf Yardeni, Head of R&D for Treato, an online social healthcare solution, headquartered in Israel.
Three years ago I joined Treato, a social healthcare analysis firm to help treato.com scale up to its present capability. Treato is a new source for healthcare information where health-related user generated content (UGC) from the Internet is aggregated and organized into usable insights for patients, physicians and other healthcare professionals. With oceans of patient-written health-related information available on the Web, and more being published each day, Treato needs to be able to collect and process vast amounts of data – Treato is Big Data par excellence, and my job has been to bring Treato to this stage.
Before the Hadoop era
HBaseCon 2012 is only a month away! The conference takes place May 22 in San Francisco, California and the event is poised to sell out.
Apache HBase is an open source software project that provides users with the ability to do real-time random read/write access to their data in Apache Hadoop. This means that when you want to use Hadoop for real-time data processing, HBase is the project you are looking for. The HBase developer community includes contributors from many organizations such as StumbleUpon, Facebook, Salesforce.com, TrendMicro, eBay, Explorys, Huawei and Cloudera. In fact, the HBaseCon Program Committee, constructors of the HBaseCon 2012 agenda, are all committers and PMC members of the Apache HBase project.
Cloudera will be hosting an Apache HBase hackathon on May 23rd, 2012, the day after HBaseCon 2012. The overall theme of the event will be 0.96 stabilization. If you are in the area for HBaseCon, please come down to our offices in Palo Alto the next day to attend the hackathon. This is a great opportunity to contribute some code towards the project and hang out with other HBasers.
More details are on the hackathon’s Meetup page. Please RSVP so we can better plan lunch, room size, and other logistics for the event. See you there!
The Bay Area HBase User Group March 2012 meetup was held at the StumbleUpon offices in San Francisco, California. 80 interested Apache HBasers were in attendance to mingle and listen to the scheduled presentations.
Michael Stack started the meetup by reminding folks to register for HBaseCon 2012 in San Francisco on May 22nd. Nick Dimiduk and Cloudera’s Amandeep Khurana then announced an early access program for their upcoming book, HBase In Action. Interested folks can get a discount for the program by using the code “hbase38.”
Apache HBase 0.92.1 is now available. This release is a marked improvement in system correctness, availability, and ease of use. It’s also backwards compatible with 0.92.0 — except for the removal of the rarely-used transform functionality from the REST interface in HBASE-5228.
Apache HBase 0.92.1 is a bug fix release covering 61 issues – including 6 blockers and 6 critical issues, such as:
Some of the configuration properties found in Apache Hadoop have a direct effect on clients, such as Apache HBase. One of those properties is called “dfs.datanode.max.xcievers”, and belongs to the HDFS subproject. It defines the number of server side threads and – to some extent – sockets used for data connections. Setting this number too low can cause problems as you grow or increase utilization of your cluster. This post will help you to understand what happens between the client and server, and how to determine a reasonable number for this property.
Since HBase is storing everything it needs inside HDFS, the hard upper boundary imposed by the ”dfs.datanode.max.xcievers” configuration property can result in too few resources being available to HBase, manifesting itself as IOExceptions on either side of the connection. Here is an example from the HBase mailing list , where the following messages were initially logged on the RegionServer side:
We’re excited to host the first ever HBaseCon this May 22, 2012 in San Francisco – the industry’s first Apache HBase community conference. The theme of this first HBaseCon is “Real-Time Your Hadoop,” and the conference will be a culmination of the best of the HBase community, including speakers and sponsors from across the HBase landscape.
As Michael Stack, the champion of the HBase community and engineer at StumbleUpon, put it, “It’s going to be a great day out for the community. Anyone with any kind of an HBase itch at all whether apps, ops, or dev should be sure to drop by.”
More than 150 people attended the San Francisco Bay Area HBase User Group meetup last Thursday, January 19th, at eBay headquarters in San Jose, California. Presenters from StumbleUpon, Facebook, eBay and MapR shared a wealth of information about Apache HBase operations and optimizations, gleaned from their experience running HBase in production environments.
One special item of note: Michael Stack announced HBaseCon 2012, taking place this spring in the Bay Area. This inaugural conference will focus on the growth and education of the HBase community. While details of the event are not yet published, the call for speakers is currently open. Submit your abstract here.
Today the Apache HBase community has proudly released Apache HBase 0.92.0, a major new version of the scalable distributed data store inspired by Google’s BigTable. Over 670 issues were addressed, so in this post I’ll highlight some of the major features and enhancements and describe what they mean for HBase users, admins, and developers.
While the most visible change to the project is the new project logo, the most important changes for users are the performance and robustness improvements to HBase’s core functionality. On the performance side, there are a few major highlights:
This was my summer internship project at Cloudera, and I’m very thankful for the level of support and mentorship I’ve received from the Apache HBase community. I started off in June with a very limited knowledge of both HBase and distributed systems in general, and by September, managed to get this patch committed to HBase trunk. I couldn’t have done this without a phenomenal amount of help from Cloudera and the greater HBase community.
The amount of memory available on a commodity server has increased drastically in tune with Moore’s law. Today, its very feasible to have up to 96 gigabytes of RAM on a mid-end, commodity server. This extra memory is good for databases such as HBase which rely on in memory caching to boost read performance.
Apache HBase 0.90.5 is now available. This release of the scalable distributed data store inspired by Google’s BigTable is a fix release that covers 81 issues, including 5 considered blockers, and 11 considered critical. The release addresses several robustness and resource leakage issues, fixes rare data-loss scenarios having to do with splits and replication, and improves the atomicity of bulk loads. This version includes some new supporting features including improvements to hbck and an offline meta-rebuild disaster recovery mechanism.
The 0.90.5 release is backward compatible with 0.90.4. Many of the fixes in this release will be included as part of CDH3u3.
This guest blog post is from Alex Loddengaard, creator of FoneDoktor, an Android app that monitors phone usage and recommends performance and battery life improvements. FoneDoktor uses WibiData, a data platform built on Apache HBase from Cloudera’s Distribution including Apache Hadoop, to store and analyze Android usage data. In this post, Alex will discuss FoneDoktor’s implementation and discuss why WibiData was a good data solution. A version of this post originally appeared at the WibiData blog.
At last month’s Hadoop World, one of the sessions spotlighted FoneDoktor, an Android app that collects data about device performance and app resource usage to offer personalized battery and performance improvement recommendations directly to users. In this post, I’ll talk about how I used WibiData — a system built on Apache HBase from CDH — as FoneDoktor’s primary data storage, access, and analysis system.
San Francisco, Salesforce.com HQ - Recently there was an Apache HBase Pow-wow where project contributors gathered to discuss the directions of future releases of HBase in person. This group included a quorum of the core committers from Facebook, StumbleUpon, Salesforce, eBay, and Cloudera as well as many contributors and users from other companies. This was an open discussion, and in compliance with Apache Software Foundation policies, the agenda and detailed minutes were shared with the community at large so that everyone can chime in before any final decisions are made.
We summarize some of the high-level discussion topics:
The Development track at Hadoop World is a technical deep dive dedicated to discussion about Apache Hadoop and application development for Apache Hadoop. You will hear committers, contributors and expert users from various Hadoop projects discuss the finer points of building applications with Hadoop and the related ecosystem. The sessions will touch on foundational topics such as HDFS, HBase, Pig, Hive, Flume and other related technologies. In addition, speakers will address key development areas including tools, performance, bringing the stack together and testing the stack. Sessions in this track are for developers of all levels who want to learn more about upcoming features and enhancements, new tools, advanced techniques and best practices.
This blog was originally posted on the Apache Blog: https://blogs.apache.org/sqoop/entry/apache_sqoop_overview
Using Hadoop for analytics and data processing requires loading data into clusters and processing it in conjunction with other data that often resides in production databases across the enterprise. Loading bulk data into Hadoop from production systems or accessing it from map reduce applications running on large clusters can be a challenging task. Users must consider details like ensuring consistency of data, the consumption of production system resources, data preparation for provisioning downstream pipeline. Transferring data using scripts is inefficient and time consuming. Directly accessing data residing on external systems from within the map reduce applications complicates applications and exposes the production system to the risk of excessive load originating from cluster nodes.
Continuing with our practice from Cloudera’s Distribution Including Apache Hadoop v2 (CDH2), our goal is to provide regular (quarterly), predictable updates to the generally available release of our open source distribution. For CDH3 the first such update is available today, approximately 3 months from when CDH3 went GA.
For those of you who are recent Cloudera users, here is a refresh on our update policy:
Klout’s goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has created A LOT of noise and an attention economy. Influence has the power to drive this attention.
When a company, brand, or person creates content, our goal is to measure the actions on that content. We want to measure every view, click, like, share, comment, retweet, mention, vote, check-in, recommendation, and so on. We want to know how influential the person who *acted* on that content is. We want to know the actual meaning of that content. And we want to know all of this, over time.
I recently gave a talk at the LA Hadoop User Group about Apache HBase Do’s and Don’ts. The audience was excellent and had very informed and well articulated questions. Jody from Shopzilla was an excellent host and I owe him a big thanks for giving the opportunity to speak with over 60 LA Hadoopers. Since not everyone lives in LA or could make it to the meetup, I’ve summarized some of the salient points here. For those of you with a busy day, here’s the tl;dr:
This is the third and final post in a series detailing a recent improvement in Apache HBase that helps to reduce the frequency of garbage collection pauses. Be sure you’ve read part 1 and part 2 before continuing on to this post.
It’s been a few days since the first two posts, so let’s start with a quick refresher. In the first post we discussed Java garbage collection algorithms in general and explained that the problem of lengthy pauses in HBase has only gotten worse over time as heap sizes have grown. In the second post we ran an experiment showing that write workloads in HBase cause memory fragmentation as all newly inserted data is spread out into several MemStores which are freed at different points in time.
Arena Allocators and TLABs
This is the second post in a series detailing a recent improvement in Apache HBase that helps to reduce the frequency of garbage collection pauses. Be sure you’ve read part 1 before continuing on to this post.
Recap from Part 1
In last week’s post, we noted that HBase has had problems coping with long garbage collection pauses, and we summarized the different garbage collection algorithms commonly used for HBase on the Sun/Oracle Java 6 JVM. Then, we hypothesized that the long garbage collection pauses are due to memory fragmentation, and devised an experiment to both confirm this hypothesis and investigate which workloads are most prone to this problem.
Today, rather than discussing new projects or use cases built on top of CDH, I’d like to switch gears a bit and share some details about the engineering that goes into our products. In this post, I’ll explain the MemStore-Local Allocation Buffer, a new component in the guts of Apache HBase which dramatically reduces the frequency of long garbage collection pauses. While you won’t need to understand these details to use Apache HBase, I hope it will provide an interesting view into the kind of work that engineers at Cloudera do.
This post was authored by Dmitry Chechik, a software engineer at TellApart, the leading Customer Data platform for large online retailers.
Apache Hadoop is widely used for log processing at scale. The ability to ingest, process, and analyze terabytes of log data has led to myriad applications and insights. As applications grow in sophistication, so does the amount and variety of the log data being produced. At TellApart, we track tens of millions of user events per day, and have built a flexible system atop HBase for storing and analyzing these types of logs offline.
This post is courtesy of Kumanan Rajamanikkam, Lead Engineer at Wordnik.
Wordnik’s Processing Challenge
At Wordnik, our goal is to build the most comprehensive, high-quality understanding of English text. We make our findings available through a robust REST api and www.wordnik.com. Our corpus grows quickly—up to 8,000 words per second. Performing deep lexical analysis on data at this rate is challenging to say the least.
“My library is in the classpath but I still get a Class Not Found exception in a MapReduce job” – If you have this problem this blog is for you.
Java requires third-party and user-defined classes to be on the command line’s “-classpath” option when the JVM is launched. The
hadoop wrapper shell script does exactly this for you by building the classpath from the core libraries located in /usr/lib/hadoop-0.20/ and /usr/lib/hadoop-0.20/lib/ directories. However, with MapReduce you job’s task attempts are executed on remote nodes. How do you tell a remote machine to include third-party and user-defined classes?
Neil Kodner, an independent consultant, is the guest author of this post. Neil found inspiration, which spurred innovation at Hadoop World 2010 from a moments decision to capture the #hw2010 streaming Twitter feed.
During the Hadoop World 2010 keynote, a majority of attendees were typing away on their laptops as Mike Olson and Tim O’Reilly dazzled the audience. Many of these laptop-users appeared to be tweeting as the keynote was taking place. Since I have more than a passing interest in twitter, Hadoop, and text mining, I thought it would be a great idea to track and store everyone’s Hadoop World tweets.
This post was contributed by Friso van Vollenhoven from Xebia. Friso is a consultant at Xebia and he is currently working at the RIPE NCC’s Information Services department on migrating an existing MySQL based solution to use Hadoop and HBase. Xebia performs large scale development projects, consultancy in architecture & auditing and helps to bring your middleware under control. Guiding clients in how to leverage agile is Xebia’s passion.
The RIPE NCC is one of five Regional Internet Registries (RIRs) providing Internet resource allocations, registration services and co-ordination activities that support the operation of the Internet globally. The RIPE NCC also provides services for the benefit of the Internet community at large. Amongst these is the Routing Information Service, which collects and stores Internet routing data from several locations around the globe. At RIPE NCC, we are in the progress of migrating an existing MySQL based system to use Apache HBase as storage backend and Hadoop MapReduce as framework for processing import jobs. In this post we will provide some background on our efforts, how we implemented it on top of Apache Hadoop and HBase and our experiences with using Hadoop and HBase in a real life scenario.
Our vision for Hadoop World is a conference where both newcomers and experienced Hadoop users can learn and be part of the growing Hadoop community.
We are also offering training sessions for newcomers and experienced Hadoop users alike. Whether you are looking for an Introduction to Hadoop, Hadoop Certification, or you want to learn more about related Hadoop projects we have the training you are looking for.
Apache Hadoop and Apache HBase are gaining popularity due to their flexibility and tremendous work that has been done to simplify their installation and use. This blog is to provide guidance in sizing your first Hadoop/HBase cluster. First, there are significant differences in Hadoop and HBase usage. Hadoop MapReduce is primarily an analytic tool to run analytic and data extraction queries over all of your data, or at least a significant portion of them (data is a plural of datum). HBase is much better for real-time read/write/modify access to tabular data. Both applications are designed for high concurrency and large data sizes. For a general discussions about Hadoop/HBase architecture and differences please refer to Cloudera, Inc. [https://wiki.cloudera.com/display/DOC/Hadoop+Installation+Documentation+for+Cloudera+Enterprise, http://blog.cloudera.com/blog/2010/07/whats-new-in-cdh3-b2-hbase], or Lars George blogs [http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html]. We expect a new edition of the Tom White’s Hadoop book [http://www.hadoopbook.com] and a new HBase book in the near future as well.
With the recent release of CDH3b2, many users are more interested than ever to try out Cloudera’s Distribution for Hadoop (CDH). One of the questions we often hear is, “what does it take to migrate?”.
If you’re not familiar with CDH3b2, here’s what you need to know.
Announcing Two New Training Classes from Cloudera: Introduction to HBase and Analyzing Data with Hive and Pig
Cloudera is pleased to announce two new training courses: a one-day Introduction to HBase and a two-day session on Analyzing Data with Hive and Pig. These join a recently-expanded two-day Hadoop for Administrators course and our popular three-day Hadoop for Developers offering, any of which can be combined to provide extensive, customized training for your organization. Please contact firstname.lastname@example.org for more information regarding on-site training, or visit www.cloudera.com/hadoop-training to view our public course schedule.
Cloudera’s HBase course discusses use-cases for HBase, and covers the HBase architecture, schema modeling, access patterns, and performance considerations. During hands-on exercises, students write code to access HBase from Java applications, and use the HBase shell to manipulate data. Introduction to HBase also covers deployment and advanced features.
This post was contributed by John Sichi, a committer on the Apache Hive project and a member of the Data Infrastructure team at Facebook.
As many readers may already know, Hive was initially developed at Facebook for dealing with explosive growth in our multi-petabyte data warehouse. Since its release as an Apache project, it has been put into use at a number of other companies for solving big data problems. Hive storage is based on Hadoop‘s underlying append-only filesystem architecture, meaning that it is ideal for capturing and analyzing streams of events (e.g. web logs). However, a data warehouse also has to relate these event streams to application objects; in Facebook’s case, these include familiar items such as fan pages, user profiles, photo albums, or status messages.
Hive can store this information easily, even for hundreds of millions of users, but keeping the warehouse up to date with the latest information published by users can be a challenge, as the append-only constraint makes it impossible to directly apply individual updates to warehouse tables. Up until now, the only practical option has been to periodically pull snapshots of all of the information from live MySQL databases and dump them to new Hive partitions. This is a costly operation, meaning it can be done at most daily (leading to stale data in the warehouse), and does not scale well as data volumes continue to shoot through the roof.
Around the globe, more and more companies are turning to Hadoop to tackle data processing problems that don’t lend themselves well to traditional systems. Users in the community consistently ask us to offer training in more places and expand our course offerings, and those who have obtained certification have reported great success connecting with companies investing in Hadoop. All of this keeps us pretty excited about the long term prospects for Hadoop.
We recently announced our first international developer training sessions in Tokyo (sold out, waitlist available) and Taiwan, and we’re happy to follow up with sessions in the EU. We’ll be visiting London the first week of June, and Berlin the next. If you’ll be in Berlin that week, be sure to check out the Berlin Buzzwords conference – a two day event focused on Hadoop, Lucene, and NoSQL.
In September 2009, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable. It’s a long road of feedback, bug fixes, QA and testing to move from testing to stable. As someone who tracks the maturity of a testing build throughout its life cycle, I’m pleased to say we’ve put a lot of polish into this release.
At the beginning of September, we announced the first release of CDH2, our current testing repository. Packages in our testing repository are recommended for people who want more features and are willing to upgrade as bugs are worked out. Our testing packages pass unit and functional tests but will not have the same “soak time” as our stable packages. A testing release represents a work in progress that will eventually be promoted to stable.
We plan on pushing new packages into the testing repository every 3 to 6 weeks. And it just so happens it is just about 3 weeks after we announced the first testing release. So it must be time for a new one. Here are some of the highlights: