Our thanks to Rakesh Rao of Quaero, for allowing us to re-publish the post below about Quaero’s experiences using partitioning in Apache Hive.
In this post, we will talk about how we can use the partitioning features available in Hive to improve performance of Hive queries.
Hive is a good tool for performing queries on large datasets, especially datasets that require full table scans. But quite often there are instances where users need to filter the data on specific column values. Generally, Hive users know about the domain of the data that they deal with. With this knowledge they can identify common columns that are frequently queried in order to identify columns with low cardinality which can be used to organize data using the partitioning feature of Hive. In non-partitioned tables, Hive would have to read all the files in a table’s data directory and subsequently apply filters on it. This is slow and expensive—especially in cases of large tables.
The concept of partitioning is not new for folks who are familiar with relational databases. Partitions are essentially horizontal slices of data which allow larger sets of data to be separated into more manageable chunks. In Hive, partitioning is supported for both managed and external tables in the table definition as seen below.
CREATE TABLE REGISTRATION DATA ( userid BIGINT, First_Name STRING, Last_Name STRING, address1 STRING, address2 STRING, city STRING, zip_code STRING, state STRING ) PARTITION BY ( REGION STRING, COUNTRY STRING )
As you can see, multi-column partition is supported (REGION/COUNTRY). You do not need to include the partition columns in the table definition and you can still use them in your query projections. The partition statement lets Hive alter the way it manages the underlying structures of the table’s data directory. If you browse the location of the data directory for a non-partitioned table, it will look like this:
.db/. All the data files are directly written to this directory. In case of partitioned tables, subdirectories are created under the table’s data directory for each unique value of a partition column. In case the table is partitioned on multiple columns, then Hive creates nested subdirectories based on the order of partition columns in the table definition. For instance, from the above example of the registration data table the subdirectories will look like the example below.
/quaero.db/registration-data/region=South America/country=BR /quaero.db/registration-data/region=South America/country=ME /quaero.db/registration-data/region=North America/country=US /quaero.db/registration-data/region=North America/country=CA
When a partitioned table is queried with one or both partition columns in criteria or in the
WHERE clause, what Hive effectively does is partition elimination by scanning only those data directories that are needed. If no partitioned columns are used, then all the directories are scanned (full table scan) and partitioning will not have any effect.
A few things to keep in mind when using partitioning:
- It’s important to consider the cardinality of the column that will be partitioned on. Selecting a column with high cardinality will result in fragmentation of data and put strain on the name node to manage all the underlying structures in HDFS.
- Do not over-partition the data. With too many small partitions, the task of recursively scanning the directories becomes more expensive than a full table scan of the table.
- Partitioning columns should be selected such that it results in roughly similar size partitions in order to prevent a single long running thread from holding up things.
hive.exec.dynamic.partition.modeis set to
strict, then you need to do at least one static partition. In non-strict mode, all partitions are allowed to be dynamic.
- If your partitioned table is very large, you could block any full table scan queries by putting Hive into strict mode using the
set hive.mapred.mode=strictcommand. In this mode, when users submit a query that would result in a full table scan (i.e. queries without any partitioned columns) an error is issued.
Quaero’s data management platform (QDMP) uses partitioning extensively and we have greatly benefited from it. QDMP uses an identifier called “dataset instance id” to identify a chunk of data that flows through the system. The system also uses this column to collect stats about the data as it passes through various phases. Partitioning the tables of QDMP on this column, along with using Apache Parquet (incubating) as the storage format, helped us cut down the run times of our workflows by almost half.
This improvement was especially evident in the case of tables that were holding large historical data — prior to partitioning, a full table scan of these tables was done in order to collect the stats. Partitioning also enabled us to selectively expire portions of data without having to rebuild the table. In addition, we also partitioned our embedded analytics tables that are frequently queried upon by analytics team members. In this case, we selected the candidate columns for partitioning after analyzing the data query patterns.
In conclusion, in our experience, using Hive partitioning in the right context and on appropriate columns will help a data management platform be much more efficient.
Rakesh Rao is a Lead/Advisory Software Engineer at Quaero.