In this guest post, Skool’s architects at BT Group explain its origins, design, and functionality.
With increased adoption of big data comes the challenge of integrating existing data sitting in various relational and file-based systems with Apache Hadoop infrastructure. Although open source connectors (such as Apache Sqoop) and utilities (such as Httpfs/Curl on Linux) make it easy to exchange data, data engineering teams often spend an inordinate amount of time writing code for this purpose. This time investment usually is needed for two main reasons: because of the multiple different data structures involved, and because of necessary support for enterprise security frameworks like Kerberos.
For example, at BT, one common use for Hadoop involves analyzing the performance of broadband lines. This process involves importing data from multiple different relational database systems containing columns for product, location, faults, orders, configuration, and so on, spanning hundred and thousands of tables. Furthermore, many different file feeds containing various network performance parameters have to be ingested, in real time, into HDFS for analysis. This effort also includes creating Apache Oozie workflow jobs for milestone and incremental pulls with different latencies, and subsequent creation of requisite Apache Hive tables.
To facilitate this process, BT’s data engineering team wanted a re-usable framework that would generate all the scripts that are ready to be deployed, support automated regression testing, and offer the flexibility to add desired customizations. They evaluated several commercial and open source tools to fill that need; however, all of them were ruled out for one reason or another. For example:
- Gobblin is more focused on data flow scheduling than on ingestion or extraction.
- Apache Nifi does not cover requisite end-to-end flow, nor does it integrate with Kerberos.
- Oracle Data Integrator has limited support for big data sources.
In the end, BT data engineers decided to create their own framework that would replace expensive custom code with a few clicks, which evolved into a tool called Skool. Skool, which BT has now open sourced under the MIT License, supports the following functions:
- Seamless data transfer to/from a relational database (currently Oracle Database, Microsoft SQL Server, MySQL, or any JDBC-compliant database) or flat files and HDFS
- Automatic creation of Hive tables
- Automatic generation and deployment of file-creation scripts and jobs from Hadoop or Hive tables
- Automatic regression testing
In summary, Skool automates Hadoop data engineering by taking a property/configuration file as input to do the following:
- Validate all user-provided details
- Connect to the source database
- Do an import (for one record) as a check
- Generate all required files and push them to HDFS
- Generate a workflow to be scheduled in Oozie
Currently, Skool implements Sqoop/Httpfs for data transfer and supports HDFS for Hadoop storage; for query execution, it relies on Hive and Apache Impala (incubating). In future releases, Skool will offer Apache Kudu as a storage option and Apache Spark as a query-execution option.
The diagram below illustrates the loosely-coupled nature of Skool components. On a high level, everything is governed by the Skool configuration file, which is used across all independent modules. Running Skool automatically generates all scripts and required files, as well as the Oozie coordinator and workflow xml that that in turn perform milestone/incremental pulls as Apache Avro data files.
Other Skool features include:
- Automatic or scheduled execution of delta and milestone replication jobs with defined frequency of data refresh
- Ability to configure for selection of tables/columns/files that are to be transferred in/out of HDFS
- Automatic performance optimization based on table size, database partitions, file formats, and compression
- Support for audit tables/lineage
- Out-of-the-box support for use with Kerberized clusters (CDH 5.5 and later)
Installing Skool is easy. First, create a directory (for example,
git clone https://github.com/BTPlc/Skool.git
log4j.properties.template files and rename them
log4j.properties, respectively. For editing
configuration.properties per your cluster specifications, follow the comments in the template file (
configuration.properties.template). Then, do:
mvn install -Dmaven.test.skip=true or mvn install
cp target/libs/* ../../libs
cp target/dataintegration-0.0.1-SNAPSHOT.jar ../../
cp configuration/configuration.properties ../../configuration/
cp Skool/di_tool_runnable/* configuration/
mv configuration/run.sh .
ojdbc6-184.108.40.206.jar file is required in the
Today, Skool is running in production at BT as its “strategic” data loading tool. By open sourcing Skool, we hope to involve new contributors toward the goal of building out new widely useful functionality such as:
- Support for integration with other RDBMS sources, such as IBM Netezza/DB2, and with NoSQL sources such as Apache Cassandra, MongoDB, and so on
- Ability to process flat-file formats such as binary and XML (currently, Skool supports only delimited files)
- Support for Hadoop storage file formats such as Apache Parquet, RC, and Apache ORC (currently, Skool supports only Avro)
If you want to report a bug, see/request a feature, or work on something else, please get involved via Github or write to us at SKOOL.SUPPORT@bt.com.
Nitin Goyal is Platform-Director, Revenue Assurance & Big Data COE, at BT.
Anup Goel is Engineering & In-life Lead, Revenue Assurance & Big Data Centre Of Excellence (CoE), at BT.
Sangamesh Gugwad is Lead Consultant, Big Data COE at BT.
Manish Bajaj is a Data Engineer at BT.
Abhinav Meghmala is a Big Data Consultant/Developer at BT.
Prabhu Om is a Big Data Analyst at BT.