Big Data’s New Use Cases: Transformation, Active Archive, and Exploration
Now that Apache Hadoop is seven years old, use-case patterns for Big Data have emerged. In this post, I’m going to describe the three main ones (reflected in the post’s title) that we see across Cloudera’s growing customer base.
Transformations (T, for short) are a fundamental part of BI systems: They are the process through which data is converted from a source format (which can be relational or otherwise) into a relational data model that can be queried via BI tools.
In the late 1980s, the first BI data stacks started to materialize, and they typically looked like Figure 1.
Figure 1 – Traditional ETL
In this setup the ETL tool is responsible for pulling the data from the source systems (Extract), then it did the transformation into target data (Transform), then it finally loads the data into the target data mart (Load). Traditional ETL tools have four main components:
- Connectors to read/write from/to many different types of sources/targets
- Business logic and metadata for what transformations need to be done
- Transformation libraries for how to perform different types of transformations
- A transformation engine that does that actual data processing
In the mid-to-late 1990s, the data sizes started to grow beyond the ability of ETL tools to finish the transformation work within allocated time (the ETL window) — that is, they couldn’t execute the T fast enough, which forced the industry to move from ETL to ELT (more about ELT later).
Matters were further complicated by human error. For example, if it is discovered that part of the ETL logic is incorrect for the last three months, the transformations have to be redone. This problem often takes weeks to be corrected, as the ETL tool has to redo all that past work while continuing to do the work for new daily data as it arrives. In summary, ETL transformation engines became a significant bottleneck.
ELT is similar to ETL but with one core difference: The heavy lifting of executing the transformation logic is pushed to the databases. Some of the work is pushed to the source databases (downstream), and some to the target databases (upstream). The transformation of unstructured data (i.e. non-relational) is handled by creating a BLOB (Binary Large Object) column in the database, then relying on UDFs (User Defined Functions) and stored procedures that implement the transformation libraries that the ETL also requires.
In this model, the ETL tool is primarily a maintainer of the business logic and an orchestrator of the transformation work. Basically, the ETL tool is no longer in the critical path of data flow, and is simply there to manage the execution as opposed to perform it.
In ELT, the BI data stack starts to look more like Figure 2.
Figure 2: ELT: Source DB -> partial T in Source DB -> Target DB -> partial T in target DB -> BI Reporting (ETL tool orchestrates the execution)
The move to ELT worked because years ago, several RDBM systems started to evolve into MPP architectures (Massively Parallel Processing) supporting execution across a number of servers. That parallelism made the RDBMS a much better place to do the heavy lifting of the transformation versus the ETL tool. It is worth noting that some of the ETL tool vendors tried to parallelize their systems but few succeeded. Furthermore, enterprises had already invested significant money in creating the MPP RDBMS clusters, and they didn’t want to maintain two separate large clusters – one for the ETL grid, and another for the RDBMS.
However, the RDBMSs were created to do queries (Q for short), and not batch transformations. So, as data sizes started to grow again, not only did this approach lead to missing ETL SLA windows (Service Level Agreements), it started missing the Q performance SLAs, too. So, this approach eventually led to a double whammy: the Ts and the Qs slowed down.
As the industry struggled with database Q performance getting bogged down by T, and by missing T completion SLAs, a new solution evolved: Apache Hadoop, a distributed data storage and processing system built from the ground up to be massively scalable (thousands of servers) using industry-standard hardware. Furthermore, Hadoop was built to be very flexible in terms of accepting data of any type, regardless of structure.
Now that Hadoop has been commercially available in the market for years, hundreds of organizations are moving the T function from their databases to Hadoop. This movement achieves a number of key benefits:
- Hadoop can perform T much more effectively than RDBMSs. Besides the performance benefits it is also very fault tolerant and elastic. If you have a nine-hour transformation job running on 20 servers, and at the eighth hour four of these servers go down, the job will still finish — you will not need to rerun it from scratch. If you discover a human error in your ETL logic and need to rerun T for the last three months, you can temporarily add a few nodes to the cluster to get extra processing speed, and then decommission those nodes after the ETL catch-up is done.
- When the T moves to Hadoop the RDBMSs are free to focus on doing the Q really well. It is surprising how much faster these systems get once voluminous unstructured data is moved out of them.
The new data architecture with Hadoop doing the T looks like Figure 3.
Figure 3: Source Systems -> Hadoop -> do T in Hadoop -> Parallel copy to DB -> do Q in DB
A number of ETL tools already support orchestrating execution inside of Hadoop (e.g. Informatica and Pentaho). This allows you to take your existing business transformation logic and define it through the ETL tool as usual, then push the T execution to happen inside of Hadoop.
We define Return on Byte (ROB) as the ratio of the value of a byte divided by the cost of storing that byte. Analytical RDBMSs are built to be very low-latency OLAP systems for running repetitive queries on extremely high-value bytes (hence they cost a significant premium on a per-TB basis). However, the original unstructured data before going through the transformation phase is too voluminous by nature leading to a low value per byte. This is also true for historical data, since as data gets older, the value on a per-byte basis gets lower.
Figure 4: The Return-on-Byte Iceberg
While the unstructured and historical bytes in aggregate have a very high value, the individual bytes have very little value. When the ROB drops below 1 — the byte value is less than the cost — what most organizations do is archive that data (to tape or otherwise). Archival systems are very cheap for data storage (especially when compared to RDBMSs) but they are astronomically expensive for data retrieval (since you can’t run processing/queries in archival systems, you have to recover the data back into an RDBMS). Typically once data is archived you never see it again, it literally takes an Act of God (disaster recovery) or Government (auditing/subpoenas) to bring that data back. So in many ways, archival systems are where data goes to die, despite still having a ton of value.
In addition to being a great system for doing T, Hadoop is also very good at doing Q for high-granularity and historical data (i.e. data with low value density). In October 2012, Cloudera announced a new open-source system called Cloudera Impala. Impala provides low-latency OLAP capabilities (in SQL) on top of HDFS and Apache HBase. This new system complements the MapReduce batch-transformation system so that you can get the best of both worlds: fast T and economical Q. Impala turns Hadoop into an “active archive”; now you can keep your historical data accessible, but more importantly, you can extract the latent value from it (as opposed to keeping it dormant in a passive archival system).
In layman terms, not all of your data deserves to fly in “first class”. Obviously, the critically imperative data, the data with the highest value density (lots of dollars per byte), deserves to bypass all the long lines to arrive to the decision makers as soon as possible. The high cost to store and query that first-class data is justified since its value is congruently high on a per byte basis. However, with Big Data, you are collecting and keeping all of the raw most granular events over many years of history. Though the value of that data in aggregate is very high, the value on a per-byte basis is very small and doesn’t justify a first-class ticket. Instead of grounding that data, it is much better to give it an “economy-class” ticket, which enables it to eventually arrive albeit a bit behind the first-class data. This economy-class option is much better than not arriving at all (passive tape archive), and hence suffering the opportunity cost of not letting that data “fly”.
Figure 5 shows what such a hybrid BI environment looks like.
Figure 5: Active Archive: Hadoop contains a lot of granular historical data, the DB has aggregate recent data, and BI tools on top can query either system using ODBC/JDBC and SQL.
Flexibility is one of the three key differences between traditional RDBMSs and Big Data platforms like Hadoop and Impala (the other two differences being massive scalability and storage economics). At a high level, the flexibility benefit is about allowing the schemas to be dynamic/descriptive as opposed to static/prescriptive.
For RDBMSs a schema has to be defined first, then the input data converted from its original native format to the proprietary storage format of the database. If any new data shows up that wasn’t pre-defined in that static schema, this new data can’t flow in until: (1) the schema is amended, (2) the ETL logic is changed to parse/load that new data, and (3) the BI data model is amended to recognize it.
This adds a lot of rigidity, especially when dealing with quickly evolving data. It does have benefits though — by parsing the data and laying it out efficiently at write-time, the RDBMSs are able to do many optimizations to enable the OLAP queries to finish extremely fast. Furthermore, by having a well-governed static schema across the enterprise, different groups can collaborate and know what the different columns mean. So we want the OLAP optimization and governance of traditional RDBMSs, but we need to augment that with the ability to be agile.
The traditional RDBMS schema-on-write model is good at answering the “unknown known” questions — those that we could model the schema for ahead of time. But there is a very large class of exploratory questions that fall under the category of “unknown unknowns” — questions that we didn’t know/expect ahead of time. Hadoop with Impala allows us to augment RDBM system with the capability to do such explorations.
For example, in your day-to-day BI job, you might get a new question from your business analyst that isn’t modeled appropriately in your data model. You are now faced with a dilemma: How can I be sure this question is important enough for me to change all the business logic necessary to expose the new underlying data for that question? It’s a chicken-and-egg problem because you can’t know the true value of the question without having the capability to ask the question at first place!
Impala with Hadoop gives you the capability to ask these types of questions. Then, once you find the true value, and decide that this is a question you want to ask over and over again, then you will have the justification necessary to go through adding this new attribute or dimension to your pristine data model. It takes the guesswork out of the process, it allows us to iterate quickly, fail-fast, and finally pick the winning insights in a much more agile way.
Impala, and Hadoop at large, get their flexibility from employing a schema-on-read model versus schema-on-write. In other words, the schema is described at the latest stage possible (aka “late binding”) when the question is being asked as opposed to the earliest stage possible when the data is being born. The underlying storage system in Hadoop accepts files in their original native format just like any other filesystem; the files are just copied as-is (e.g. tab-delimited text files, CSV, XML, JSON, syslog files, images, …). When you are running a query inside Impala it then parses the file and extracts the relevant schema at run time. (Think of it as ETL on the fly.) You can see how this approach of parsing data during query runtime can lead to slower speed for the interactive queries compared to an RDBMS that paid the cost once at write time, but the benefit is extreme agility. New data can start flowing into the system in any shape or form, and months later you can change your schema parser to immediately expose the new data elements without having to go through an extensive database reload or column recreation.
You learned about three core use cases for Hadoop here:
- Transformations are bogging down the query performance of RDBM systems. RDBMSs were built to do Queries very well. T is much better suited for a batch processing system like Hadoop which offers the agility of absorbing data of any type, scalably processing that data, and at an economical cost matched to the value of such raw data.
- Moving data with low value density to passive archival systems, aka data graveyards, leads to a significant loss of value since while that data is of low value on a per byte basis, it is extremely valuable in aggregate. Impala allows you to continue to extract the latent value from your data by powering an active archive that is economically suited for the value of the data.
- The descriptive data-model agility of Hadoop and Impala makes them a very powerful tool for exploring all of your data and extracting new value that you couldn’t have been able to find before due to the rigid schemas of RDBMSs. You truly gain the ability to explore the unknown unknowns, and not just the known unknowns.
Remember, this isn’t an either-or proposition; RDBM systems have their place when it comes to high-value known questions and rigid governance constraints.
Amr Awadallah is the CTO of Cloudera.