Cloudera Blog · HBase Posts
In this installment of “Meet the Engineer”, we meet with Eric Sammer (invariably known as just plain “Sammer”), Apache committer and author of the upcoming O’Reilly book, Hadoop Operations.
What do you do at Cloudera, and in which Apache project are you involved?
I’ve been lucky enough to be part of a few different teams at Cloudera since I joined. Almost three years ago, I joined Cloudera as a Solution Architect; a member of the professional services team. Most of my time was spent working with customers to build out Apache Hadoop and Apache HBase clusters, and designing data integration and processing pipelines. I also occasionally had the opportunity to fill in with the training team, teaching Cloudera’s Hadoop Developer and Administration courses to both public and private groups. There’s nothing more exciting than getting to hang out with a group of smart people and talk about Hadoop all day. I moved into a Principal Solution Architect role, spending more time in the office, working on architectural patterns and problems that repeat across customers, and working with internal teams on ways to improve CDH and Cloudera Manager.
Organizations in diverse industries have adopted Apache Hadoop-based systems for large-scale data processing. As a leading force in Hadoop development with customers in half of the Fortune 50 companies, Cloudera is in a unique position to characterize and compare real-life Hadoop workloads. Such insights are essential as developers, data scientists, and decision makers reflect on current use cases to anticipate technology trends.
Recently we collaborated with researchers at UC Berkeley to collect and analyze a set of Hadoop traces. These traces come from Cloudera customers in e-commerce, telecommunications, media, and retail (Table 1). Here I will explain a subset of the observations, and the thoughts they triggered about challenges and opportunities in the Hadoop ecosystem, both present and in the future.
Table 1. Summary of Hadoop workloads analyzed
Apache HBase junkies, this one’s for you: I had an opportunity recently for a quick chat with the authors of HBase in Action (Manning Publications – download sample chapter PDF), by Nick Dimiduk and Cloudera’s Amandeep Khurana.
Why did you write HBase in Action?
In June 2012, Eli Collins (@elicollins), from Cloudera’s Platforms team, led a session at QCon New York 2012 on the subject “Introducing Apache Hadoop: The Modern Data Operating System.” During the conference, the QCon team had an opportunity to interview Eli about several topics, including important things to know about CDH4, main differences between MapReduce 1.0 and 2.0, Hadoop use cases, and more. It’s a great primer for people who are relatively new to Hadoop.
You can catch the full interview (video and transcript versions) here.
This is the second blogpost about Apache HBase replication. The previous blogpost, HBase Replication Overview, discussed use cases, architecture and different modes supported in HBase replication. This blogpost is from an operational perspective and will touch upon HBase replication configuration, and key concepts for using it — such as bootstrapping, schema change, and fault tolerance.
As mentioned in HBase Replication Overview, the master cluster sends shipment of WALEdits to one or more slave clusters. This section describes the steps needed to configure replication in a master-slave mode.
- All tables/column families that are to be replicated must exist on both the clusters.
- Add the following property in $HBASE_HOME/conf/hbase-site.xml on all nodes on both clusters; set it to true.
We are happy to announce the general availability of CDH3 update 5. This update is a maintenance release of CDH3 platform and provides a considerable amount of bug-fixes and stability enhancements. Alongside these fixes, we have also included a few new features, most notable of which are the following:
Apache HBase Replication is a way of copying data from one HBase cluster to a different and possibly distant HBase cluster. It works on the principle that the transactions from the originating cluster are pushed to another cluster. In HBase jargon, the cluster doing the push is called the master, and the one receiving the transactions is called the slave. This push of transactions is done asynchronously, and these transactions are batched in a configurable size (default is 64MB). Asynchronous mode incurs minimal overhead on the master, and shipping edits in a batch increases the overall throughput.
This blogpost discusses the possible use cases, underlying architecture and modes of HBase replication as supported in CDH4 (which is based on 0.92). We will discuss Replication configuration, bootstrapping, and fault tolerance in a follow up blogpost.
HBase replication supports replicating data across datacenters. This can be used for disaster recovery scenarios, where we can have the slave cluster serve real time traffic in case the master site is down. Since HBase replication is not intended for automatic failover, the act of switching from the master to the slave cluster in order to start serving traffic is done by the user. Afterwards, once the master cluster is up again, one can do a CopyTable job to copy the deltas to the master cluster (by providing the start/stop timestamps) as described in the CopyTable blogpost.
In the recent blog post about the Apache HBase Write Path, we talked about the write-ahead-log (WAL), which plays an important role in preventing data loss should a HBase region server failure occur. This blog post describes how HBase prevents data loss after a region server crashes, using an especially critical process for recovering lost updates called log splitting.
As we mentioned in the write path blog post, HBase data updates are stored in a place in memory called memstore for fast write. In the event of a region server failure, the contents of the memstore are lost because they have not been saved to disk yet. To prevent data loss in such a scenario, the updates are persisted in a WAL file before they are stored in the memstore. In the event of a region server failure, the lost contents in the memstore can be regenerated by replaying the updates (also called edits) from the WAL file.
A region server serves many regions. All of the regions in a region server share the same active WAL file. Each edit in the WAL file has information about which region it belongs to. When a region is opened, we need to replay those edits in the WAL file that belong to that region. Therefore, edits in the WAL file must be grouped by region so that particular sets can be replayed to regenerate the data in a particular region. The process of grouping the WAL edits by region is called log splitting. It is a critical process for recovering data if a region server fails.
Apache Flume is a scalable, reliable, fault-tolerant, distributed system designed to collect, transfer, and store massive amounts of event data into HDFS. Apache Flume recently graduated from the Apache Incubator as a Top Level Project at Apache. Flume is designed to send data over multiple hops from the initial source(s) to the final destination(s). Click here for details of the basic architecture of Flume. In this article, we will discuss in detail some new components in Flume 1.x (also known as Flume NG), which is currently on the trunk branch, techniques and components that can be be used to route the data, configuration validation, and finally support for serializing events.
In the past several months, contributors have been busy adding several new sources, sinks and channels to Flume. Flume now supports Syslog as a source, where sources have been added to support Syslog over TCP and UDP.
Flume now has a high performance persistent channel – the File Channel. This means if the agent fails for any reason before events committed by the source are not removed and the transaction committed by the sink, the events will reloaded from disk and can be taken when the agent starts up again. The events will only be removed from the channel when the transaction is committed by the sink. The File channel uses a Write Ahead Log to save events.
Apache HBase is the Hadoop open-source, distributed, versioned storage manager well suited for random, realtime read/write access.
Wait wait? random, realtime read/write access?
How is that possible? Is not Hadoop just a sequential read/write, batch processing system?
Yes, we’re talking about the same thing, and in the next few paragraphs, I’m going to explain to you how HBase achieves the random I/O, how it stores data and the evolution of the HBase’s HFile format.