The number of powerful data query tools in the Apache Hadoop ecosystem can be confusing, but understanding a few simple things about your needs usually makes the choice easy.
Ah, the good old days. I recall vividly that in 2007, I was faced to store 1 billion XML documents and make them accessible as well as searchable. I had few choices on a given shoestring budget: build something one my own (it was the rage back then—and still is), use an existing open source database like PostgreSQL or MySQL, or try this thing that Google built successfully and that was now implemented in open source under the Apache umbrella: Hadoop.
So I did bet on Hadoop, and Apache HBase in due course, as I failed to store that many small files in HDFS directly, or combine and maintain them. I probably had the first-ever HBase production cluster online in 2008, since the other users were still in development or non-commercial.
In 2014, the list of tools at your disposal is countless. Seemingly every other month, a new framework that solves the mother of all problems is announced—but luckily, the pace at which they join the Hadoop ecosystem is rather stable. We see evolution rather than revolution; those new projects have to prove themselves before being deemed a candidate for inclusion. But even within the Hadoop stack, we now have enough choices that in my role as one of Cloudera’s chief architects, I have been asked many times how to implement specific use cases, with features such as:
- Random reads and writes only
- Random access, but also high throughput sequential scans
- Analytical queries that are mostly scans
- Interactive vs. batch-oriented use of data
- Slowly changing dimensions (SCD) in an OLAP-like setup
In the past, either MapReduce or HBase would cover every one of these use cases. The former was the workhorse for anything batch oriented that needed high-sequential throughput as fast as disks could spin. The latter was the answer for anything else, because it was impractical to rewrite the typically very large files in HDFS for just a few small updates.
Sure, if those updates were rather rare, one could (and did) build purely MapReduce based solutions, using ETL style workflows that merged changes as part of the overall data pipeline. But for truly random access to data and being able to serve the same there was only HBase, the Hadoop Database. But then one day during Strata+Hadoop World 2012, everything changed.
MPP Query Engines
Cloudera announced Impala at that conference in October 2012 and shipped it in early 2013. Now, you have an SQL-based query engine that can query data stored natively in HDFS (and also in HBase, but that is a different topic I will address soon). With Impala, you can query data similar to commercial MPP databases; all the servers in a cluster work together to receive the user query, distribute the work amongst them, read data locally at raw disk speeds, and stream the results back to the user, without ever materializing intermediate data or spinning up new OS processes like MapReduce does. It puts MapReduce into a batch-oriented corner, and lets standard BI tools connect directly with Hadoop data.
However, Impala also raises a slew of new questions about where HBase fits. Could Impala replace HBase? Probably not, as it still deals with immutable files that were staged by ETL workflows higher up the data ingest and processing pipeline (also see this earlier post). In practice, I often end up in situations where the customer is really trying to figure out where one starts and the other ends. I call this process the “Trough of Indecision”:
The primary driver here is what you need to achieve: high throughput, sequential scans, or random access to your data that you need to keep current along the way?
Those are the most obvious choices. But what if you also need random access but sequential scans as well? Or be as fast as possible for scans but also update the data? That’s where the decisions get harder.
The SCD Problem
In the relational world, and especially the analytical OLAP one, there is a modeling technique referred to as “slowly changing dimensions” (SCD). (You can get much more info on this from one of Cloudera’s webinars, held by the one-and-only Ralph Kimball. Suffice to say that you have laid out data in a relational database that allows you to update dimension tables over time. Those are then JOINed (the SQL operation) with the fact tables when a report needs to be generated. If you move this data over into Hadoop and especially HDFS, you have many choices to engineer a suitable solution. (Again, please see this post for more details.) Typically you would land the data and transform it into the Apache Parquet file format. Often you pre-materialize the final results so that reading it does not involve any heavy lifting—something attributed to the cost of the necessary I/O at scale.
On the HBase side, you can store the data differently as well, since you have the power to embed or nest dependent entities into the main record. This is different from the star schema that you may retain in the HDFS version of the data. You can also create a very flexible schema that allows you to cover many aspects of usually normalized data structures.
How could you handle the SCD problem in either HDFS with Parquet format, or with HBase as a row-based, random access store? Both give you the ability to update the data—either rewrite the immutable data files, or update columns directly—so why use either?
Amdahl’s Law–Or, the Cost of Resurrection
With HBase, there is an inherent cost to converting data into binary representation and back again. I have run some tests recently and the bottom line is that schema design in HBase is ever so important. It is basically Amdahl’s Law at play, which says that some sequential part of an algorithm defines the overall performance of an operation. And for HBase (this also applies to any other data store that handles small data points), this is the deserialization of the cells, aka the key-value pair. It represents a fixed cost, while the variable cost is based on how large the cell is (for loading and copying it in memory).
For a very small cell, the fixed cost dominates. Once you are dealing with large cells, the fixed cost is small in comparison. In other words, small cells are faster on a cells-per-second basis, but slower on a MB/sec basis. In HBase, it is not “rows per second” but “cells per second” that matters. Note: Keeping fewer versions around also should increase scan performance linearly. The diagram shows the difference of row vs. cell based access and how versions matter.
You have a choice to store every data point separately or combine them into larger entities, for example an Apache Avro record, to improve scan performance. Of course, this puts the onus of decoding the larger record on the application.
For HDFS, you do not have the freedom to access data based on specific keys, like HBase offers. Rather, you have to scan some amount of files to find the information for which you are looking. To reduce the overall I/O, you typically partition the data so that specific directories contain data for a given selector. This is usually a column in the data that has a decent cardinality: not too many values, not too few. You want to end up where each partition is a few GB in size. Or more generically, the partition size should not get you into the small-files problem mentioned previously. Ideally, you have one file in each directory that has the size of N x HDFS blocksize—and is splittable for parallelism.
Conversely, it is much easier to store as much as data as you want in HDFS than it is in HBase. The latter has to maintain the said index to the cell keys, and that is a cost factor. HBase needs Java heap space, just as the small-files issue does. In the end, you only have a scarce resource that you can use one way or another—a tradeoff that needs to be handled carefully. Data in HDFS can be read as fast as the I/O permits, while HBase has to do the additional work of reconstituting the cell structures (even if only at the lowest level).
Write Amplification Woes
Both HDFS and HBase now face another issue: write amplification. If you update data in HDFS, you need to write a new delta file and eventually merge those into the augmented original, creating a new file with the updated data. The delta and original file are then deleted.
That is pretty much exactly what HBase compactions do: rewrite files asynchronously to keep them fresh. Delta files here are the so-called flush files and the original files are the older HFiles from previous flush or compaction runs. For Hive this is implemented under HIVE-5317 though that this is for slow-changing data mostly, while HBase is suited also for fast-changing data.
You can tune the I/O in HBase by changing the compaction settings. There are ways to compact more aggressively causing more background I/O, or tone down the compactions to leave more HFiles in HDFS but also incur less “hidden” cost. But the drawback is that you have to read data from more files, and if your use case does not allow you to do some sort of grouping of data, then reads will be slower—and more memory is used to hold the block data. (In other words, you will see a churn on the HBase block cache.)
With HDFS, you can do the same, but only manually. You need to create a temporary table, then UNION ALL the delta with the original file, and finally swap out the file. This needs to be automated using Oozie and run at the appropriate times. The overarching question is, when is this approach “good enough”?
Appropriate Use Cases for Impala vs. HBase
So after seeing how storing is in the end just physics—since you have to convert and ship data during reads and writes—how does that concept translate to Impala and HBase? The following diagram tries to gauge the various use-cases:
- Complete random access
- Random access with scans
- Hybrid of the above
- High-throughput analytical scans
You can see that with Impala you get the best throughput; it simply has fewer moving parts and its lower granularity (see notes on Amdahl’s Law above) makes for best I/O performance. HBase, in contrast, gives you the most freedom to access data randomly. In the middle, the said trough, you have a choice. In practice, I start with the left-hand side (Impala first). Only when I have to guarantee random updates in addition to low latency do I tilt toward the HBase side.
Single Source of Truth
So how on earth do you select across these choices? What advantages and disadvantages does each have?
One approach to judge is the total cost of ownership (TCO): How often do you have to split the data to make each work? A few examples:
- Spark and Impala read data natively of HDFS, affording no extra copy. (Well, they might, depending on mixed use-cases that need different storage formats to work most efficiently.)
- Search and the NoSQL faction need a copy of some or all data. For Search you have to build separate indexes, which usually does not involve duplicate existing ones but rather enabling them in the first place.
- All the tools need memory to optimize recurring access to data. HDFS can cache data, HBase has the block cache, Search stores index structures, and Spark can pin intermediate data in memory for fast iterations. The more you spend, the more you get.
- Some NoSQL solutions trade latency for efficiency; for example, Apache Cassandra dilutes the total cluster memory by the read factor (the R in “N > R + W” to achieve consistency) as it has to read blocks of data on as many servers.
So cost, or really TCO, is an often-underestimated factor. Many decisions in organizations around the world have been ultimately made based on exactly that: it is not that one solution is better, much faster, or easier to manage than the other. It is simply the bottom line at work, and at a certain scale, the cost can be prohibitive.
Copying data multiple times is an issue when you need that very data in various forms. This varies across use cases within the same framework (say Parquet or Avro file formats with Impala) or APIs and access patterns (random access in HBase versus large scans in Impala/Spark, or full-text search in Solr). Disks are much more affordable than memory (factor 400 and above), so in practice, we often opt for the duplication of data on disk, but not in memory.
Recent efforts in HDFS also point to this conclusion as you can now pin hot datasets in memory, and in the future will be able to share this read-only data between OS processes without any further copying. Impala and Spark will benefit from that, in particular. For HBase and Search, their use cases are still quite distinct from the other two, but with HBase snapshots and being able to read that directly from HDFS shows that there is a connection—even if it is still in the form of a rather wobbly suspension bridge.
So where does NoSQL sit in the Hadoop world? It certainly is going strong. But Impala, Spark, and Search are cutting into the data cake with a vengeance. They close the gap toward the stretch area where NoSQL is out of its comfort zone and batch processing simply too slow. We span the bridge from batch, to near real-time, to interactive data access now. Doing the match on how “expensive” (in comparison), flexible, and fast each option is will guide you toward the proper selection for you.
Lars George is Cloudera’s EMEA Chief Architect, an HBase committer and PMC member, and the author of O’Reilly’s HBase: The Definitive Guide.