Thanks to Guy Harrison of Dell Inc. for the guest post below about time-tested performance optimizations for connecting Oracle Database with Apache Hadoop that are now available in Apache Sqoop 1.4.5 and later.
Back in 2009, I attended a presentation by a Cloudera employee named Aaron Kimball at the MySQL User Conference in which he unveiled a new tool for moving data from relational databases into Hadoop. This tool was to become, of course, the now very widely known and beloved Sqoop!
As someone who had worked with relational databases—particularly Oracle Database—for more than 20 years and who was becoming very interested in Hadoop, Sqoop immediately piqued my interest. A general-purpose tool for moving data between RDBMS and Hadoop obviously seemed important, but experience had already taught me that generic database access methods are rarely optimal for a specific database because each RDBMS implements its own fast-path access methods to gain competitive advantage. It instantly occurred to me that using these fast-path methods was clearly the key to making Sqoop not just a good general-purpose tool but also a high-performance solution.
For MySQL, Aaron had already integrated a fast-path export based on the
mysqldump utility. I wondered if my team at Quest Software (now part of Dell Software) could create a fast-path access method for Oracle.
When we looked at Sqoop, we found a few significant opportunities for optimization. The most significant involved how Sqoop distributes work across mapper tasks. When parallelizing across a large table, Sqoop uses primary-key ranges to break up the work. Each mapper runs a query to grab a range of primary key values. This approach can lead to a few undesirable outcomes in Oracle Database, for example:
- The Oracle optimizer uses cost-based algorithms to decide between primary key index lookups and full table scans. Sometimes, Oracle will execute each primary key range lookup by a full table scan—resulting in multiple scans hitting the Oracle database simultaneously.
- Oracle Database may even decide to use its own parallel query capability to parallelize individual mapper tasks—increasing the load on the database and disturbing relative load created by each mapper.
- In Oracle Database, physical location of data is not normally dictated by primary key—so a physical block on disk may contain data required by multiple mappers. Consequently, individual blocks on disk would each be accessed multiple times creating excessive IO.
- When the Oracle Database optimizer decides to use index scans, the RDBMS will load the resulting data into its buffer cache—driving out blocks required for normal database operation with blocks that are only needed by Sqoop.
- Primary keys are often not uniformly distributed: as older rows are deleted, the older primary key ranges become sparse. Sqoop’s algorithms could not take this into account and often one mapper would have many more rows to process than another.
Thus for our initial implementation of a Sqoop connector (informally called “OraOop”), we had the following design goals:
- Partition data to the mappers based on physical storage characteristics so that each mapper is allocated a completely distinct set of physical blocks and no physical block is read more than once.
- Make sure each mapper receives an equal amount of work.
- Bypass Oracle Database parallelism and the Oracle buffer cache.
- Neither require nor use indexes.
- Use Oracle “direct path” IO.
The first release of OraOop in 2010 achieved these goals, and as Sqoop added bi-directional transfer capabilities, we performed similar optimizations for moving data from Hadoop to Oracle, including exploiting the Oracle direct-path insert capability, utilizing Oracle partitions to maximize parallelism, and adding a
MERGE capability that simultaneously updates and inserts into the target Oracle tables.
Despite the free availability of our Apache-licensed connector, many Sqoop users understandably continued to use default Sqoop, sometimes with disappointing outcomes. So in early 2014 Dell, Cloudera, and others in the Apache Sqoop community collaborated to bring the OraOop code directly into core Sqoop. We are all very happy to see this functionality now fully integrated into Sqoop, starting with release 1.4.5 (packaged in CDH 5.1 and later).
Assuming you use the
direct=true clause on your Sqoop command line, all the optimizations outlined above will be employed. (Use of this clause does require that the Oracle account have the privileges required to read the Oracle extent information; it is not supported for views or index-organized tables.) There is no longer any need to install an additional Oracle connector.
The chart below illustrates a typical reduction in Oracle Database overhead when using the new Sqoop 1.4.5
-direct=true setting (import of a 310GB, 1-billion row table). The test platform had the following characteristics:
- 9-node cluster on AWS EC2 (1 x NameNode/JobTracker, 8 x DataNode/TaskTracker), each with 8 CPUs and 15GB RAM, running CDH 5.1
- 1 32-CPU database server with 60GB RAM, running Oracle Database 184.108.40.206.0
- 10GB Ethernet on the database server, 1GB Ethernet on the Hadoop nodes
As you can see, overall elapsed time reduced by 83%, database time (total time consumed by all database operations) reduced by 90%, and IO requests reduced by 99%! The next chart shows the scalability profile for the
direct=true option compared to
direct=false (the only available option for Oracle Database transfers to Sqoop prior to 1.4.5).
We can draw these conclusions:
- When data is un-clustered (arranged on disk in effectively random order), the
direct=falsemode scales very poorly (blue line) because the index-range scans do not result in contiguous IO requests on disk—resulting in a lot of duplicate IO by each mapper.
- As the number of mappers increases, this IO load clobbers the database and elapsed time degrades. When data is highly clustered (green line), performance scales better because each mapper’s index-range scan accesses distinct blocks of data on disk.
- The new Sqoop direct mode (red line) is unaffected by the clustering of primary keys because it always reads blocks directly from disk in physically contiguous ranges. The absolute improvement in elapsed time varies, but is generally from 5-20 times greater than the non-direct mode.
As you can see, the performance benefits and reduction in load on the Oracle side of these optimizations is extremely significant. If you are performing data transfers between Oracle Database and Hadoop, we encourage you to try out the new Scoop 1.4.5 direct mode.
Guy Harrison (@guyharrison) is an Executive Director of Research and Development at the Dell Software Group. Guy is the author of five books and many articles on database and data management and writes a monthly column for Database Trends and Applications.