Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group

Categories: Data Ingestion Guest Hadoop

In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.

With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose. This time investment usually is needed for two main reasons: because of the multiple different data structures involved, and because of necessary support for enterprise security frameworks like Kerberos.

For example, at BT, one common use for Hadoop involves analyzing the performance of broadband lines. This process involves importing data from multiple different relational database systems containing columns for product, location, faults, orders, configuration, and so on, spanning hundred and thousands of tables. Furthermore, many different file feeds containing various network performance parameters have to be ingested, in real time, into HDFS for analysis. This effort also includes creating Apache Oozie workflow jobs for milestone and incremental pulls with different latencies, and subsequent creation of requisite Apache Hive tables.

To facilitate this process, BT’s data engineering team wanted a re-usable framework that would generate all the scripts that are ready to be deployed, support automated regression testing, and offer the flexibility to add desired customizations. They evaluated several commercial and open source tools to fill that need; however, all of them were ruled out for one reason or another. For example:

  • Gobblin is more focused on data flow scheduling than on ingestion or extraction.
  • Apache Nifi does not cover requisite end-to-end flow, nor does it integrate with Kerberos.
  • Oracle Data Integrator has limited support for big data sources.

In the end, BT data engineers decided to create their own framework that would replace expensive custom code with a few clicks, which evolved into a tool called Skool. Skool, which BT has now open sourced under the MIT License, supports the following functions:

  • Seamless data transfer to/from a relational database (currently Oracle Database, Microsoft SQL Server, MySQL, or any JDBC-compliant database) or flat files and HDFS
  • Automatic creation of Hive tables
  • Automatic generation and deployment of file-creation scripts and jobs from Hadoop or Hive tables
  • Automatic regression testing

In summary, Skool automates Hadoop data engineering by taking a property/configuration file as input to do the following:

  1. Validate all user-provided details
  2. Connect to the source database
  3. Do an import (for one record) as a check
  4. Generate all required files and push them to HDFS
  5. Generate a workflow to be scheduled in Oozie

skool-f1

Currently, Skool implements Sqoop/Httpfs for data transfer and supports HDFS for Hadoop storage; for query execution, it relies on Hive and Apache Impala (incubating). In future releases, Skool will offer Apache Kudu as a storage option and Apache Spark as a query-execution option.

skool-f2

Skool Architecture

The diagram below illustrates the loosely-coupled nature of Skool components. On a high level, everything is governed by the Skool configuration file, which is used across all independent modules. Running Skool automatically generates all scripts and required files, as well as the Oozie coordinator and workflow xml that that in turn perform milestone/incremental pulls as Apache Avro data files.

skool-f3

Other Skool features include:

  • Automatic or scheduled execution of delta and milestone replication jobs with defined frequency of data refresh
  • Ability to configure for selection of tables/columns/files that are to be transferred in/out of HDFS
  • Automatic performance optimization based on table size, database partitions, file formats, and compression
  • Support for audit tables/lineage
  • Out-of-the-box support for use with Kerberized clusters (CDH 5.5 and later)

Installing Skool

Installing Skool is easy. First, create a directory (for example, Skool_tool):

Then, do:

Edit the configuration.properties.template, password.properties.template, and log4j.properties.template files and rename them configuration.properties, password.properties, and log4j.properties, respectively. For editing configuration.properties per your cluster specifications, follow the comments in the template file (configuration.properties.template). Then, do:

(Note: The ojdbc6-11.2.0.3.jar file is required in the <Skool/di_tool_runnable/> directory.

Conclusion

Today, Skool is running in production at BT as its “strategic” data loading tool. By open sourcing Skool, we hope to involve new contributors toward the goal of building out new widely useful functionality such as:

  • Support for integration with other RDBMS sources, such as IBM Netezza/DB2, and with NoSQL sources such as Apache Cassandra, MongoDB, and so on
  • Ability to process flat-file formats such as binary and XML (currently, Skool supports only delimited files)
  • Support for Hadoop storage file formats such as Apache Parquet, RC, and Apache ORC (currently, Skool supports only Avro)

If you want to report a bug, see/request a feature, or work on something else, please get involved via Github or write to us at SKOOL.SUPPORT@bt.com.

Nitin Goyal is Platform-Director, Revenue Assurance & Big Data COE, at BT. 

Anup Goel is Engineering & In-life Lead, Revenue Assurance & Big Data Centre Of Excellence (CoE), at BT.

Sangamesh Gugwad is Lead Consultant, Big Data COE at BT.

Manish Bajaj is a Data Engineer at BT.

Abhinav Meghmala is a Big Data Consultant/Developer at BT.

Prabhu Om is a Big Data Analyst at BT.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

8 responses on “Skool: An Open Source Data Integration Tool for Apache Hadoop from BT Group

  1. Alex

    I am very excited on hearing this information and thank you so much for open sourcing it. I would like to see myself how it works and need your help in setting it up for the first time.
    I am using a Quick start VM from Cloudera running CDH 5.5.2. I have a MySQL db running in the VM. Now, how can I go about setting up SKOOL in my VM and connect it with my MySQL db? A brief explanation would definitely help millions of folks who are interested to know about this technology and would like to setup in their VM.

  2. Joe Witt

    A quick search for ‘NiFi kerberos’ quickly shows the statement about Apache NiFi and Kerberos to be incorrect.

    Apache NiFi does indeed integrate with Kerberos both for authentication and for interaction with a variety of Kerberos enabled protocols.

  3. Himanshu

    Hello,
    I tried to install the skool on my dev environment and not able to start the skool.
    Got below error.

    Error: Could not find or load main class com.bt.dataintegration.property.config.DIConfigService

    Please suggest how to resolve this dependency.

    1. Nitin Goyal

      @Himanshu – As discussed please check the version of JDK and Maven to ensure they are as per release notes.

      @Joe – Yes Apache Nifi can indeed integrate with Kerberos, however the integration requires keytab files etc for each user accessing so we took a different approach. We will get the blog post edited to correct the statement about Nifi.

      @Alex – At this time the MySQL connector is a work in progress – Oracle does work. Maybe you can give MySQL connector a try and contribute to Skool? :-)

  4. Michał Woś

    But how about kafka->HDFS ingestion? Can Skool do that? Gobblin seems pretty straightforward with that usecase.

  5. nikhil

    can we use skool to move data between two cluster in different data center like we are using distcp.