Cloudera Connector for Teradata 1.0.0

Apache Sqoop (incubating) provides an efficient approach for transferring big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data warehouses, and NoSQL systems). The extensible architecture used by Sqoop allows support for a data store to be added as a so-called connector. By default, Sqoop comes with connectors for a variety of databases such as MySQL, PostgreSQL, Oracle, SQL Server, and DB2. In addition, there are also third-party connectors available separately from various vendors for several other data stores, such Couchbase, VoltDB, and Netezza. This post will take a brief look at the newly introduced Cloudera Connector for Teradata 1.0.0.

Features

A key feature of the connector is that it uses temporary tables to provide atomicity on data transfer. This feature ensures that either all or none of the data are transferred during import and export operations. Moreover, the connector opens JDBC connection against Teradata for fetching and inserting data, and it automatically injects appropriate parameter underneath to use the FastExport/FastLoad feature of Teradata for fast performance.

Installation

The first thing you will need is to install Sqoop. CDH3 documentation serves as a good reference on how to do this. You also need the Teradata JDBC JAR files (terajdbc4.jar and tdgssconfig.jar), and they can be put under the lib directory of your Sqoop installation (so Sqoop can pick them up at run time). One last thing is to enable Sqoop to process the Teradata JDBC URL syntax with the specialized Teradata manager factory. To do this, you can add the following inside a sqoop-site.xml file within the configuration directory of your Sqoop installation:

<configuration>
  <property>
    <name>sqoop.connection.factories</name>
    <value>com.cloudera.sqoop.manager.TeradataManagerFactory</value>
  </property>
</configuration>
 


Once you have finished the above steps, you are ready to download the connector and give it a try.

Data Import

Using this connector is similar to using other connectors with a difference on the JDBC URL syntax, which looks like jdbc:teradata:///DATABASE=. For example, to import all data from a table named MY_DATA on a database named MY_BASE into HDFS:

sqoop import
  --connect jdbc:teradata://hostname/DATABASE=MY_BASE
  --username sqooptest  --password xxxxx
  --table MY_DATA --num-mappers 3
 


You can import the data into Hive as well by modifying the above example slightly as:

sqoop import
  --connect jdbc:teradata://hostname/DATABASE=MY_BASE
  --username sqooptest  --password xxxxx
  --table MY_DATA --num-mappers 3
  --hive-import
 


Data Export

Similarly, to export all data under a HDFS directory named MY_DATA to a table with the same name on a Teradata database named MY_BASE:

sqoop export
  --connect jdbc:teradata://localhost/DATABASE=MY_BASE
  --username sqooptest --password xxxxx
  --table MY_DATA --num-mappers 3
  --export-dir /user/johndoe/MY_DATA
 


Temporary Tables

By default, temporary tables are used to ensure atomicity. During import operation, a temporary table is created with a copy of the original data divided into a number of partitions. The number of partitions in the temporary table is the same as the number of mappers in Hadoop (as specified by the user). Then each individual mapper pulls data from one single partition, and multiple mappers run in parallel to one another. After data are successfully imported, the temporary table is deleted.

In the opposite direction, during export operation, each individual mapper creates a temporary table to load data into. Similar to import operation, multiple mappers runs in parallel to one another. At the end, data from all temporary tables are merged into the target table, and then the temporary tables are deleted as well. If for any reason export operation cannot be completed successfully, the target table will remain intact.

Disabling the temporary tables

While temporary tables provide the benefit of atomicity, it also takes up additional time and space. Hence, there is a switch available to turn off the temporary table usage as needed depending on the use case scenario. To disable the temporary table usage during import operation, for example,

sqoop import
  -D sqoop.teradata.import.use.temporary.table=false
  --connect jdbc:teradata://hostname/DATABASE=MY_BASE
  --username sqooptest  --password xxxxx
  --table MY_TABLE --num-mappers 3
 


In this case, all mappers pull data from the source table MY_TABLE directly at the same time.

Note that temporary tables cannot be disabled during export operation due to the technical constraint of using the FastLoad feature.

Choosing a different location

The temporary table is created under the database that is currently connected.

sqoop import
  --connect jdbc:teradata://hostname/DATABASE=MY_BASE
  --username sqooptest  --password xxxxx
  --table MY_DATA --num-mappers 3
 


In the above example, the temporary table is created under MY_BASE, which happens to be the same database where the source table MY_DATA is located. This can be changed by providing a qualified table name under a database different from the one that is currently connected. For example,

sqoop import
  --connect jdbc:teradata://hostname/DATABASE=TEMP_BASE
  --username sqooptest  --password xxxxx
  --table MY_BASE.MY_DATA --num-mappers 3
 


In this case, the temporary table is created under TEMP_BASE while the source table MY_DATA is located under MY_BASE. As a result, you may have a dedicated temporary space for all temporary data if needed.

Performance Consideration

From Hadoop perspective, you can increase the number of mappers for better performance in general. The number of mappers in parallel may be up to the number of task trackers times 2 (by default). However, you will also need to take your Teradata settings into consideration. In other words, parallelism may also be limited by the maximum number of FastExport/FastLoad processes allowed in Teradata at the mean time (see MaxLoadTasks and MaxLoadAWT settings in Teradata for further details).

This connector takes advantage of the FastExport/FastLoad feature of the underlying JDBC connection for fast performance. This feature in Teradata is recommended for dealing with a large amounts of data, and thus the connector is best used with such situation.

Where to Get More

This post only provides an overview on the Cloudera Connector for Teradata. If you are interested in learning more, a full user guide is available here.

Filed under:

1 Response
  • Sanjeev / December 03, 2012 / 6:59 AM

    Thanks, very informative post on teradata connector for sqoop.

    We are doing a proof of concept with this connector to import and export very large datasets (upwards of 500 million rows). In our testing, we observe that regardless of the number of mappers we specify in the import command, the bulk of the dataset rows are handled by one mapper, which in turns defeats the purpose of parallelism. We specify a unique primary key column, so that should not be an issue. Can you point us in the right direction?

Leave a comment


two − 2 =