How-to: Ingest Data Quickly Using the Kite CLI

Categories: Guest How-to Kite SDK

Thanks to Ben Harden of CapTech for allowing us to re-publish the post below.

Getting delimited flat file data ingested into Apache Hadoop and ready for use is a tedious task, especially when you want to take advantage of file compression, partitioning and performance gains you get from using the Avro and Parquet file formats. 

In general, you have to go through the following steps to move data from a local file system to HDFS.

  • Move data into HDFS. If you have a raw file you can use the command line, if you’re pulling from a relational source I recommend using a tool like Apache Sqoop to easily land data and automatically create a schema and Hive table.
  • Describe and document your schema in an Avro compatible JSON schema. If you ingested data using Sqoop you’re in luck because the schema is now available to you in the Hive Metastore. If not, you need to create the schema definition by hand.
  • Define the partitioning strategy.
  • Write a program to convert your data to Avro or Parquet.
  • Using the schema created in step 2 and the file created in step 3 you can now create a schema in Hive and use HQL to view data.

Going through those steps to ingest a large amount of new data can get time consuming and very tedious. Fortunately, the Kite SDK and associated command-line interface (CLI) exist and make the process much easier

I’m not a Java developer, so I opted to use the CLI to bulk load my data into HDFS and exposed via Hive. In this example, I used a comma delimited set of 25 baseball statistics data files, with data dating back to 1893. 

Here are the steps I went through to quickly ingest this data into HDFS using Kite.

  1. Install the Kite CLI by running the following command:

  1. Create a folder called ingest and download and unzip the baseball statistics data into the folder

  1. Create the following shell script and name it ingestHive.sh

  1. Change the script to be executable and run

  1. All data is now ingested into HDFS in compressed Avro format and tables are created in Hive
  2. We can check to confirm that tables exist in in Hive by running

I can use the same technique to just define a schema and load to HDFS directly, without creating a Hive table.  This is useful if the processing I want to do will not require the Hive Metastore.  As an example, I modified the above script to create a set of parquet files in HDFS:

As you can see, using Kite makes the process of ingesting, converting and publishing to Hive easy. A fairly simple ingest engine could be built using the above techniques to monitor files landing to an edge node, and as they are received automatically ingest, convert, partition and publish data to the Hive Metastore and HDFS.

Ben Harden leads the Big Data Practice at CapTech and has over 17 years of enterprise software development experience including project management, requirements gathering, functional design, technical design, development, training, testing and system implementation.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

2 responses on “How-to: Ingest Data Quickly Using the Kite CLI

  1. Orrin Edenfield

    I thought that if you have a delimited flat file and know the schema of the data then it’s pretty simple to create a table in the Hive Metastore using 3 HiveQL statements. 1: Create external table statement with schema of delimited flat file and HDFS storage location. 2. Create the empty Parquet storage table using CREATE ParquetTable LIKE TextFileTable STORED AS PARQUET;. 3. Loading into the Parquet table just needs a INSERT OVERWRITE TABLE ParquetTable SELECT * FROM TextFileTable; command.

    Avro may be different and you’d still need to define your partitioning strategy within those create table statements but I’ve used the 3 steps above with moderate success.

  2. Mark Kidwell

    There are a few good reasons to use Kite even for simpler cases:
    1. Kite can process CSVs that can’t be parsed directly by Hive, for example, RFC4180 compatible files that contain embedded newlines and would cause a split with text input formats. HIVE-8630 concerns one of the issues with Hive’s input format and splitting, there may be others needing work in that area.
    2. Until Hive 0.14, it wasn’t possible to create an Avro-backed table without also specifying a schema file. Kite can infer a schema from a csv file header row and input data and generate an Avro schema file automatically for use with the ingestion operation.
    3. Schema inferrence isn’t perfect, but can still be helpful as a standalone feature when creating a new object – generate a schema, review and update, then use updated version for further ingestion.