I recently presented a How-tos for Gurus series session on data modeling for big data systems. During the presentation, a number of attendees asked some very interesting questions. As many of you know, big data systems are known to have less formality around the need for structure, yet for data warehouses to continue to serve traditional use cases, modeling is still a very important function. I’d like to share some of the more common questions I received during the session, and when visiting organizations, and respond to them.
In big data environments, are there any modeling techniques I can use to improve query performance?
To improve query performance, it depends on what tool you are using. The following guidelines can help:
- Ensure you select the best storage for your end user’s queries. For example, if you’re running a lot of short queries, you should consider HBase. For long-running analytic queries, You might find that Kudu is better. Ideally, examine the queries to be run, and decide on the appropriate file format for those use cases.
- Use the right query engine for the workload. For example, long running queries traditionally seen in enterprise data warehouse scenarios, feeding dashboards or standard reports, Hive on LLAP is great. Impala, on the other hand, is fantastic for ad-hoc querying into 100TB+ of data. When configuring your query engine, you should also ensure the partitioning is set, collect statistics, make sure your joins are being appropriately designed, look at the query performance reports and optimize accordingly.
- Make sure that the tools you select to retrieve data are appropriate for each use case. Look at tools like Phoenix or HBase with APIs for running queries, and then use Impala or Hive on LLAP for querying the data.
Our data scientists like denormalized tables, or “feature files”. Can we still keep this concept while we model our big data systems?
Absolutely. This is a core feature of the modern data warehouse, and it is called an Analytical Base Table (ABT). Imagine we are a major telecom company, with tables for service usage, incoming calls, network elements, etc. To build a customer churn model across all these tables, we create an ABT for customer data, and build our data science model on top of that. We can slice by customer, by cell tower, by revenue model, etc. ABTs are like data marts created on top of the data warehouse, whether it is star schema or not, so that tools like SAS, or R, or others that need a flat structure, may operate without having to restructure the data, and without forgoing the need for other use cases to have the more traditional fact and dimension type data model.
Are there any Industry data model for Internet of Things and Big Data warehouses?
Pre-formed, predefined industry specific data models used to be quite important, and many major data warehouse vendors would supply them as part of the data warehouse solution. And while we do still see a number of these in use today, the world has changed quite a bit from the 1990s and 2000s, where we would create such a model. The changing nature of the data we use today forces us to question structured norms. A great example of this can be found in this blog on the data model consequences of a United States Supreme Court decision on marriage (https://qntm.org/support) where changing a data model built decades ago using a heteronormative construct to one that accommodates not just same-sex marriages, but the aded dimensions of divorce, re-marrriage and even gender change after marriage for one or both partners, can become a challenge for traditional structures. So the answer to industry standard modeling in a big data world is that we do not model the entire industry – instead we model for the end user requirements, and as such multiple models, which change at any point in time, flow easily from the data and allow for multiple structures, on the same data, to accommodate each use case, rather stick to a one size fits all approach.
For example, in a telecom company, call data is stored in three or four different formats. The first is for monitoring agencies to see who is calling who, which may be stored as a graph. The second is an HBase or Kudu store may be queried according to mobile number to retrieve the last 10-30 calls – a very discrete query. HDFS might also be used for long term analysis, like total calls per day for a given city or region. At the end of the day, this is all the same data, stored three ways, for three use cases, to ensure optimal results. Industry data models are not obsolete, per-se, but they need to be complemented by more flexible methods of data modeling at the use case level. Remember that in big data, we have the luxury of defining structure after data ingest, and defining structure on demand, so we can take advantage of these more modern approaches for our benefit.
When modeling relational structures, we often rely on indexing to make searches faster. In big data modeling, do we still need to worry about indexing?
Yes, and no. It all depends on the file format and the data. For example, when using Hadoop HDFS, the storage technology makes searches faster by massive parallelism, and therefore you do not have, or need, traditional indexing. ORC does have a concept of indexing, however it also uses bloom filters. For example, in a telecom data model, we have a primary key defined as a subscriber’s mobile number, with columns like customer type, customer city, customer address, etc. in ORC we may create a bloom filter on all of these columns, and when you select rows from this table, the filter will kick in and will only read the ORC files where there exists some data for the search criteria (like city is Los Angeles). Remember in big data systems, we distribute data in files across hundreds, or thousands, of partitions, and when you have large tables with large partitions, you can skip lots of data files from being included when using the Bloom filter, essentially replacing the concept of a traditional index, especially at a massive scale.
What kind of partitioning or bucketing is needed for joining fact and dimension tables for reporting?
Partitioning can be very helpful, depending on the storage being used. In a big data context, partitioning can be very helpful in reducing the number of files needed to be examined to return search results (see the response above about Bloom Filters for more on that). For example we regularly will partition a fact table by date, or for very large data sets, even by the hour. For dimensions, we can partition based on use cases, for example, if our users are regularly looking for results within their area, we can partition by geography. However, you are not limited to only one partitioning method, as you can partition logically, as well, and that is very helpful as the same data will be used by different users with different motivations, and as a result we can have multiple partitions each serving a different business need.
When modeling for big data, do surrogate keys help in better join performance compared to natural keys?
Yes, surrogate keys can definitely help. Often we find that surrogate keys are basically faster to join, especially when the natural key is a string column. Integers are easier to compare for join performance. However, there are other advantages. Surrogate keys make sure you are isolated from the source system changes. For example, if you move from an on-premises sales force management tool to one based in the cloud, you do not have to map old natural key to new natural key, the surrogates can remain the same and help ensure consistency of data feeds into the warehouse without having to change end reporting.
Can we join a very large fact table with nearly one billion records with multiple dimension tables, some of them with over a million records each?
Yes, this is where a modern data warehouse really shines, especially with the latest versions of Cloudera’s solution, these types of joins can even be done very fast. Overall performance is data and configuration dependent, so we recommend using tools like Cloudera Workload XM to help, or consult a specialist in designing your data warehouse for such large workloads.
Data models change over time. I know how we manage schema versioning in production systems in relational databases. Is versioning different when dealing with modeling big data?
Data modeling version control is no different from versioning in a traditional environment. For example, in Parquet and ORC, it is very easy to simply add a new column, but it is not easy to dropone. Changing data types may require a function to convert stored data (like string to integer). Often, if you want to make significant changes, you might have to re-create the dimension or fact table. However, just like with relational systems, there are techniques you can use to make it easier: like instead of changing a column data type, simply add a new column with the new datatype. Remember, that in a big data world, adding a column is just adding a column definition to the metadata, and only when a row has values set do we add any data to be stored. This essentially means that there is virtually no storage burden for doubling up columns such as this.
Is a big data based warehouse essentially the same as the Data Vault 2.0 concept?
Data Vault 2.0 is not a big data based data warehouse, and it is not a replacement for normalized and dimensional modeling. Data Vault 2.0 is a new way to define the staging area, but you still have the need to do a traditional model for the data warehouse itself. This is because you cannot report on a data vault with your favorite SQL-based BI and analytics tools – you need a data model to be able to make sense of the data.
Is traditional data warehousing dying?
Traditional data warehouses are not dying, what is happening is that data warehousing as a discipline is effectively evolving. It is adapting. If you recall, the age old world of building data warehouses op down used to lead to a rate of failure that was once reported at statistics of 70-80%. Imagine having 2 or 3 years invested in developing a traditional data warehouse, with all that R&D, and then see it fail. This means we need to develop data warehouses in a much more agile way, keeping abreast of business user’s changing needs, becoming more fluid and fast, and ready to adapt. With a bottom up, develop quickly, deploy, rinse and repeat according to the requirements of the project, we make data warehousing agile and adaptable and ready in days, or weeks, what used to take months and years. This means all business uses can be accomodated, quickly, without rigid infrastructure, practices and methods in the way.
Watch How-Tos For Gurus: Data Modeling for Hadoop to hear more about the best practices for data modeling in Hadoop.