Cloudera Engineering Blog · HBase Posts
The ecosystem is evolving at a rapid pace – so rapidly, that important developments are often passing through the public attention zone too quickly. Thus, we think it might be helpful to bring you a digest (by no means complete!) of our favorite highlights on a regular basis. (This effort, by the way, has different goals than the fine Hadoop Weekly newsletter, which has a more expansive view – and which you should subscribe to immediately, as far as we’re concerned.)
Find the first installment below. Although the time period reflected here is obviously more than a month long, we have some catching up to do before we can move to a truly monthly cadence.
For those people new to Apache HBase (version 0.90 and later), the configuration of network ports used by the system can be a little overwhelming.
In this blog post, you will learn all the TCP ports used by the different HBase processes and how and why they are used (all in one place) — to help administrators troubleshoot and set up firewall settings, and help new developers how to debug.
This how-to is the third in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 showed you how to insert multiple rows simultaneously using XML and JSON. Part 3 below will show how to get multiple rows using XML and JSON.
Getting Rows with XML
GET verb, you can retrieve a single row or a group of rows based on their row keys. (You can read more about the multiple value URL format here.) Here we are going to use the simple wildcard character or asterisk (*) to get all rows that start with a specific string. In this example, we can load every line of Shakespeare’s comedies with “shakespeare-comedies-*”. This also requires that our row key(s) be laid out by “AUTHOR-WORK-LINENUMBER”.
Thanks to Steven Noels, SVP of Products for NGDATA, for the guest post below.
NGDATA builds and sells Lily, the next-generation Customer Intelligence Platform that helps enterprise marketing teams collect and store customer interaction data in order to profile, segment, and present better offers. We designed Lily from the ground up to run on Apache HBase and Apache Solr. Combining these technologies with our deep marketing segmentation expertise and unique machine learning techniques we’re able to deliver interactive data management, real-time statistical calculations, faceted search views of customers, offers, interactions and the permutations they each inspire.
In Part 1 of this series about Apache HBase snapshots, you learned how to use the new Snapshots feature and a bit of theory behind the implementation. Now, it’s time to dive into the technical details a bit more deeply.
What is a Table?
An HBase table comprises a set of metadata information and a set of key/value pairs:
For those of you who missed the show, session video and presentation slides (as well as photos) will be available via hbasecon.com in a few weeks. (To be notified, follow @cloudera or @ClouderaEng.) Although it’s not quite as good as being there with the rest of the community, you’ll still be able to partake from the real-world experiences of Apache HBase users like Facebook, Box, Yahoo!, Salesforce.com, Pinterest, Twitter, Groupon, and more.
This is the week of Apache HBase, with HBaseCon 2013 taking place Thursday, followed by WibiData’s KijiCon on Friday. In the many conversations I’ve had with Cloudera customers over the past 18 months, I’ve noticed a trend: Those that run HBase stand out. They tend to represent a group of very sophisticated Hadoop users that are accomplishing impressive things with Big Data. They deploy HBase because they require random, real-time read/write access to the data in Hadoop. Hadoop is a core component of their data management infrastructures, and these users rely on the latest and greatest components of the Hadoop stack to satisfy their mission-critical data needs.
Today I’d like to shine a spotlight on one innovative company that is putting top engineering talent (and HBase) to work, helping to save the planet — literally.
HBaseCon 2013 is this Thursday (June 13 in San Francisco), and we can hardly wait!
Michael Stack is the chair of the Apache HBase PMC and has been a committer and project “caretaker” since 2007. Stack is a Software Engineer at Cloudera.
Apache Hadoop and HBase have quickly become industry standards for storage and analysis of Big Data in the enterprise, yet as adoption spreads, new challenges and opportunities have emerged. Today, there is a large gap — a chasm, a gorge — between the nice application model your Big Data Application builder designed and the raw, byte-based APIs provided by HBase and Hadoop. Many Big Data players have invested a lot of time and energy in bridging this gap. Cloudera, where I work, is developing the Cloudera Development Kit (CDK). Kiji, an open source framework for building Big Data Applications, is another such thriving option. A lot of thought has gone into its design. More importantly, long experience building Big Data Applications on top of Hadoop and HBase has been baked into how it all works.
Kiji provides a model and set of libraries that help you get up and running quickly.
As you may know, Apache HBase has a vibrant community and gets a lot of contributions from developers worldwide. The collaborative development effort is so active, in fact, that a new point-release comes out about every six weeks (with the current stable branch being 0.94).
At Cloudera, we’re committed to ensuring that CDH, our open source distribution of Apache Hadoop and related projects (including HBase), ships with the results of this steady progress. Thus, CDH 4.2 was rebased on 0.94.2, as compared to its predecessor CDH 4.1, which was based on 0.92.1. CDH 4.3 has moved one step further and is rebased on 0.94.6.1.
Unbelievably, HBaseCon 2013 is only one week away (June 13 in San Francisco)!
As we march toward HBaseCon 2013 (June 13 in San Francisco), it’s time to bring you a preview of the Internals track (see the Operations track preview here) — the track guaranteed to be of most interest to Apache HBase developers and other people tracking the progress of the code base.
As you have probably learned by now, HBaseCon 2013 sessions are organized into four tracks: Operations, Internals, Ecosystem, and Case Studies. In combination, they offer a 360-degree view of Apache HBase that is invaluable for experts and aspiring experts alike. In the next few posts leading up to the conference (June 13 in San Francisco – register now while there’s still room), we’ll offer sneak previews of what each track has to offer.
The schedule/agenda grid for HBaseCon 2013 (rapidly approaching: June 13 in San Francisco) is a thing of beauty.
HBaseCon 2013 is approaching fast – June 13 in San Francisco. If you’re on the fence about attending – or perhaps your manager is on the fence about approving your participation – here are a few things that you/they need to know (in no particular order):
- HBaseCon is the annual rallying point for the HBase community. If you’ve ever had a desire to learn how to get involved in the community as a contributor, or just want to ask a committer or PMC member why things are done (or not done) a certain way, this is your opportunity – because this is where those people are. Participating in a mailing list thread is never quite the same once you’ve met the people behind it.
- HBaseCon is a one-stop shop for learning about the HBase roadmap, as well as other projects across the ecosystem. Current HBase users should be particularly interested in learning about which JIRAs will have the most impact on the user experience – and once again, most of the committers working on those JIRAs will either be leading sessions or otherwise present. Plus, you can learn about how new complementary projects like Impala, Kiji, Phoenix, and Honeycomb are transforming the use cases for HBase and helping to expand its footprint across the enterprise.
- HBaseCon is a feast of real-world experiences and use cases. Sure, maybe you’ve read about the HBase-backed applications used by companies like Facebook, Salesforce.com, eBay, Pinterest, and Yahoo!. But wouldn’t it be helpful to hear technical details and best practices directly from the people who built and run them? I’ll bet it would. And you really can’t do that anywhere else — in the whole world. (Plus, you can take advantage of formal training right before the conference, at a discount.)
- HBaseCon is a pageant of engineer rock-stars. If your company is an HBase user and hungry for talent, there’s no better place to find it: HBaseCon is literally the world’s biggest gathering of HBase experts under one roof.
- HBaseCon is a heck of a blast. Come for the deep-dives and advice, stay for the after-event party. The libations will be extensive!
The post below was originally published at blogs.apache.org/hbase. We re-publish it here for your convenience.
Apache HBase is a distributed big data store modeled after Google’s Bigtable paper. As with all distributed systems, knowing what’s happening at a given time can help spot problems before they arise, debug on-going issues, evaluate new usage patterns, and provide insight into capacity planning.
This post was originally published via blogs.apache.org, we republish it here in a slightly modified form for your convenience:
At first glance, the Apache HBase architecture appears to follow a master/slave model where the master receives all the requests but the real work is done by the slaves. This is not actually the case, and in this article I will describe what tasks are in fact handled by the master and the slaves.
Regions and Region Servers
This how-to is the second in a series that explores the use of the Apache HBase REST interface. Part 1 covered HBase REST fundamentals, some Python caveats, and table administration. Part 2 below will show you how to insert multiple rows at once using XML and JSON. The full code samples can be found on GitHub.
Adding Rows With XML
The REST interface would be useless without the ability to add and update row values. The interface gives us this ability with the
POST verb. By posting new rows, we can add new rows or update existing rows using the same row key.
It’s time for me to give you a quarterly update (here’s the one for Q1) about where to find tech talks by Cloudera employees in 2013. Committers, contributors, and other engineers will travel to meetups and conferences near and far to do their part in the community to make Apache Hadoop a household word!
(Remember, we’re always ready to assist your meetup by providing speakers, sponsorships, and schwag.)
With HBaseCon 2013 (Early Bird registration now open!) preparations in full swing, you may be interested in learning a bit about the personalities behind the Program Committee, who are tasked with formulating a compelling, community-focused agenda.
Recently I had a chance to ask committee members Gary Helmling (Twitter), Lars Hofhansl (Salesforce.com), Jon Hsieh (Cloudera), Doug Meil (Explorys), Andrew Purtell (Intel), Enis Söztutar (Hortonworks), Michael Stack (Cloudera), and Liyin Tang (Facebook) a few questions:
The following FAQ is provided by James Taylor of Salesforce, which recently open-sourced its Phoenix client-embedded JDBC driver for low-latency queries over HBase. Thanks, James!
What is this new Phoenix thing I’ve been hearing about?
Phoenix is an open source SQL skin for HBase. You use the standard JDBC APIs instead of the regular HBase client APIs to create tables, insert data, and query your HBase data.
Hadoop Summit Europe is coming up in Amsterdam next week, so this is an appropriate time to make you aware of the Cloudera speaker program there (all three talks on Thursday, March 21):
The following guest post is provided by Aaron Kimball, CTO of WibiData.
The Kiji ecosystem has grown with the addition of a new module, KijiMR. The Kiji framework is a collection of components that offer developers a handle on building Big Data Applications. In addition to the first release, KijiSchema, we are now proud to announce the availability of a second component: KijiMR. KijiMR allows KijiSchema users to use MapReduce techniques including machine-learning algorithms and complex analytics to develop many kinds of applications using data in KijiSchema. Read on to learn more about the major features included in KijiMR and how you can use them.
There are various ways to access and interact with Apache HBase. The Java API provides the most functionality, but many people want to use HBase without Java.
There are two main approaches for doing that: One is the Thrift interface, which is the faster and more lightweight of the two options. The other way to access HBase is using the REST interface, which uses HTTP verbs to perform an action, giving developers a wide choice of languages and programs to use.
The current (4.2) release of CDH — Cloudera’s 100% open-source distribution of Apache Hadoop and related projects (including Apache HBase) — introduced a new HBase feature, recently landed in trunk, that allows an admin to take a snapshot of a specified table.
Prior to CDH 4.2, the only way to back-up or clone a table was to use Copy/Export Table, or after disabling the table, copy all the hfiles in HDFS. Copy/Export Table is a set of tools that uses MapReduce to scan and copy the table but with a direct impact on Region Server performance. Disabling the table stops all reads and writes, which will almost always be unacceptable.
(Added Feb. 25 2013: Early Bird registration is now open – closes April 23, 2013!)
Cloudera University is the world leader in Apache Hadoop training and certification. Our full suite of live courses and online materials is the best resource to get started with your Hadoop cluster in development or advance it towards production. We offer deep industry insight into the skills and expertise required to establish yourself as a leading Developer or Administrator managing and processing Big Data in this fast-growing field.
But did you know Cloudera training can also help you plan for the advanced stages and progress of your Hadoop cluster? In addition to core training for Developers and Administrators, we also offer the best (and, in some cases, only) opportunity to get up to speed on lifecycle projects within the Hadoop ecosystem in a classroom setting. Cloudera University’s course offerings go beyond the basics to include Training for Apache HBase, Training for Apache Hive and Pig, and Introduction to Data Science: Building Recommender Systems. Depending on your Big Data agenda, Cloudera training can help you increase the accessibility and queryability of your data, push your data performance towards real-time, conduct business-critical analyses using familiar scripting languages, build new applications and customer-facing products, and conduct data experiments to improve your overall productivity and profitability.
This following post was originally published via blog.apache.org; we republish it here for your convenience.
NOTE: This blog post describes how Apache HBase does concurrency control. This assumes knowledge of the HBase write path, which you can read more about in this other blog post.
For several good reasons, 2013 is a Happy New Year for Apache Hadoop enthusiasts.
In 2012, we saw continued progress on developing the next generation of the MapReduce processing framework (MRv2), work that will bear fruit this year. HDFS experienced major progress toward becoming a lights-out, fully enterprise-ready distributed filesystem with the addition of high availability features and increased performance. And a hint of the future of the Hadoop platform was provided with the Beta release of Cloudera Impala, a real-time query engine for analytics across HDFS and Apache HBase data.
Our hearty congratulations to the Cloudera engineers who have been accepted as ApacheCon NA 2013 (Feb. 26-28 in Portland, OR) speakers for these talks:
At Cloudera, we put great pride into drinking our own champagne. That pride extends to our support team, in particular.
Cloudera Manager, our end-to-end management platform for CDH (Cloudera’s open-source, enterprise-ready distribution of Apache Hadoop and related projects), has a feature that allows subscription customers to send a snapshot of their cluster to us. When these cluster snapshots come to us from customers, they end up in a CDH cluster at Cloudera where various forms of data processing and aggregation can be performed.
AssignmentManager is a module in the Apache HBase Master that manages regions to RegionServers assignment. (See HBase architecture for more information.) It ensures that all regions are assigned and each region is assigned to just one RegionServer.
Although the AssignmentManager generally does a good job, the existing implementation does not handle assignments as well as it could. For example, if a region was assigned to two or more RegionServers, some regions were stuck in transition and never got assigned, or unknown region exceptions were thrown in moving a region from one RegionServer to another.
The following post was originally published via blog.apache.org; we are re-publishing it here.
Apache Flume was conceived as a fault-tolerant ingest system for the Apache Hadoop ecosystem. Flume comes packaged with an HDFS Sink which can be used to write events into HDFS, and two different implementations of HBase sinks to write events into Apache HBase. You can read about the basic architecture of Apache Flume 1.x in this blog post. You can also read about how Flume’s File Channel persists events and still provides extremely high performance in an earlier blog post. In this article, we will explore how to configure Flume to write events into HBase, and write custom serializers to write events into HBase in a format of the user’s choice.
Announcing the Kiji Project: An Open Source Framework for Building Big Data Applications with Apache HBase
The following is a guest post from Aaron Kimball, who was Cloudera’s first engineer and the creator of the Apache Sqoop project. He is the Founder and CTO at WibiData, a San Francisco-based company building big data applications.
Our team at WibiData has been developing applications on Hadoop since 2010 and we’ve helped many organizations transform how they use data by deploying Hadoop. HBase in particular has allowed companies of all types to drive their business using scalable, high performance storage. Organizations have started to leverage these capabilities for various big data applications, including targeted content, personalized recommendations, enhanced customer experience and social network analysis.
After a long period of intense engineering effort and user feedback, we are very pleased, and proud, to announce the Cloudera Impala project. This technology is a revolutionary one for Hadoop users, and we do not take that claim lightly.
When Google published its Dremel paper in 2010, we were as inspired as the rest of the community by the technical vision to bring real-time, ad hoc query capability to Apache Hadoop, complementing traditional MapReduce batch processing. Today, we are announcing a fully functional, open-sourced codebase that delivers on that vision – and, we believe, a bit more – which we call Cloudera Impala. An Impala binary is now available in public beta form, but if you would prefer to test-drive Impala via a pre-baked VM, we have one of those for you, too. (Links to all downloads and documentation are here.) You can also review the source code and testing harness at Github right now.
Apache HBase will have a notable profile at ApacheCon Europenext month. Clouderan and HBase committer Lars George has two sessions on the schedule:
In this installment of “Meet the Engineers”, meet Todd Lipcon (@tlipcon), PMC member/committer for the Hadoop, HBase, and Thrift projects.
StumbleUpon (SU) and Cloudera have signed a technology collaboration agreement. Cloudera will support the SU clusters, and in exchange, Cloudera will have access to a variety of production deploys on which to study and try out beta software.
As part of the agreement, the StumbleUpon Apache HBase+Apache Hadoop team — Jean-Daniel Cryans, Elliott Clark and I — have joined Cloudera. From our new perch up in the Cloudera San Francisco office — 10 blocks north and 11 floors up — we will continue as first-level support for SU clusters, tending and optimizing them as we have always done. The rest of our time will be spent helping develop Apache HBase as the newest additions to Cloudera’s HBase team.
Update time! As a reminder, Cloudera releases major versions of CDH, our 100% open source distribution of Apache Hadoop and related projects, annually and then updates to CDH every three months. Updates primarily comprise bug fixes but we will also add enhancements. We only include fixes or enhancements in updates that maintain compatibility, improve system stability and still allow customers and users to skip updates as they see fit.
We’re pleased to announce the availability of CDH4.1. We’ve seen excellent adoption of CDH4.0 since it went GA at the end of June and a number of exciting use cases have moved to production. CDH4.1 is an update that has a number of fixes but also a number of useful enhancements. Among them:
With the default Apache HBase configuration, everyone is allowed to read from and write to all tables available in the system. For many enterprise setups, this kind of policy is unacceptable.
Administrators can set up firewalls that decide which machines are allowed to communicate with HBase. However, machines that can pass the firewall are still allowed to read from and write to all tables. This kind of mechanism is effective but insufficient because HBase still cannot differentiate between multiple users that use the same client machines, and there is still no granularity with regard to HBase table, column family, or column qualifier access.
We’re getting really close to Strata Conference + Hadoop World 2012 (just over a month away), schedule planning-wise. So you may want to consider adding the tutorials, sessions, and keynotes below to your calendar! (Start times are always subject to change of course.)
The ones listed below are led or co-led by Clouderans, but there is certainly a wide range of attractive choices beyond what you see here. We just want to ensure that you put these particular ones high on your consideration list.
Strata Conference + Hadoop World (Oct. 23-25 in New York City) is a bonanza for Hadoop and big data enthusiasts – but not only because of the technical sessions and tutorials. It’s also an important gathering place for the developer community, most of whom are eager to share info from their experiences in the “trenches”.
Just to make that process easier, Cloudera is teaming up with local meetups during that week to organize a series of meetings on a variety of topics. (If for no other reason, stop into one of these meetups for a chance to grab a coveted Cloudera t-shirt.)
In this installment of “Meet the Engineer”, we meet with Eric Sammer (invariably known as just plain “Sammer”), Apache committer and author of the upcoming O’Reilly book, Hadoop Operations.
What do you do at Cloudera, and in which Apache project are you involved?
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
Apache HBase junkies, this one’s for you: I had an opportunity recently for a quick chat with the authors of HBase in Action (Manning Publications – download sample chapter PDF), by Nick Dimiduk and Cloudera’s Amandeep Khurana.
In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.
You can catch the full interview (video and transcript versions) here.
This is the second blogpost about Apache HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.
As mentioned in HBase Replication Overview, the master cluster sends shipment of WALEdits to one or more slave clusters. This section describes the steps needed to configure replication in a master-slave mode.
- All tables/column families that are to be replicated must exist on both the clusters.
- Add the following property in $HBASE_HOME/conf/hbase-site.xml on all nodes on both clusters; set it to true.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following: