How-to: Use Kite SDK to Easily Store and Configure Data in Apache Hadoop

Organizing your data inside Hadoop doesn’t have to be hard — Kite SDK helps you try out new data configurations quickly in either HDFS or HBase.

Kite SDK is a Cloudera-sponsored open source project that makes it easier for you to build applications on top of Apache Hadoop. Its premise is that you shouldn’t need to know how Hadoop works to build your application on it, even though that’s an unfortunately common requirement today (because the Hadoop APIs are low-level; all you get is a filesystem and whatever else you can dream up — well, code up).

The goal of Kite’s Data module (which is used by Spring XD’s HDFS Dataset sink to store payload data in Apache Avro or Apache Parquet [incubating] format) is to simplify this process by adding a higher-level compatibility layer. The result is much more similar to how you would work with a traditional database: you describe the records you need to store, provide some basic information about how queries will look, and start streaming records into a data set.

Kite’s API lets you work in terms of data sets, views of data, and records rather than low-level storage. And interacting with a dataset is the same whether it’s stored as data files in HDFS or as records in a collection of Apache HBase cells.

In this post, you’ll learn how to use Kite to store a CSV data set in HDFS as well as Apache HBase (using the same commands), without knowing low-level details like file splits, formats, or serialization!

Using the Kite CLI

Until recently, Kite’s data support could be used only as an API. But version 0.13.0 introduced a command-line interface (CLI) that handles mundane tasks so that you can spend your time solving more interesting business problems.

To get started, download the command-line interface jar. All the dependencies you’ll need when you’re running on a cluster are included in executable jar. You’ll use the Cloudera QuickStart VM to run the commands, which you can download and use to follow along if you don’t already have a cluster. If you’re following along, you can download the jar and make it executable using these commands:

 

First, run the dataset command without any arguments. You’ll see an extended version of this message, which outlines the commands and what they do:

 

Next, you are going to use a few of these commands to create both datasets. The example data is a set of movie ratings from GroupLens. You’ll be using the small set with about 17,000 records, but the process is the same for any size dataset.

First, download and unzip the archive, and then rename the file u.item to movies.psv to make it easier to keep track of the relevant content. This is a pipe-separated file with general information on a collection of movies.

 

Show the first line of the file to see what this data looks like:

 

To load this data, you need to tell Kite what it looks like using a schema file. A schema is like a table description and tells Kite what fields are in the data and the type of each field. You can have Kite generate that schema by inspecting the data, but you need to tell it what the fields are because the movies data has no header.

Open the movies.psv file with your favorite editor and add a header line with names for the first few columns. The additional data without headers will be ignored. Your data file should look like this:

 

Now you can use Kite’s csv-schema command to generate a schema. This requires a couple of extra arguments besides the sample data file. You use --class to pass a name for the record schema being created, and --delimiter to set the field separator character, which is a pipe. (Note that Kite has inspected the sample data and identified that the id field is a long. It also assumes that all the fields may be null, and allows null for each type as well.)

 

Save this schema to a file so you can pass it to Kite when you create the movies dataset. The --output option will save it directly to HDFS.

 

Creating the Dataset

Now that you have a description of the movie data, you’re ready to create a dataset. You will use the create command, passing a name for the dataset, movies, and the path to the newly created schema.

 

To confirm that it worked, let’s view the schema of the dataset we just created using the schema command.

 

You should see the same schema that you just created. You’re ready to load some data using the csv-import command. Because you’ve already configured the schema, Kite knows how to store the data, so you just need to tell Kite that the file is pipe-delimited just like when you inspected it to build a schema.

 

Now you have a data set of 1,682 movies! The data is available to Apache Hive and Impala, and you can use the show command to list the first 10:

 

Moving to HBase

So far, you’ve created a dataset stored in HDFS, but that’s not very new. Lots of tools can do that, but Kite is different because tools built with Kite can work equally well with other storage engines. Next, you will build the same dataset in HBase, with just a little extra configuration.

Creating the movies dataset in HBase uses the same commands, but you need two extra bits of configuration: a column mapping and a partition strategy. A column mapping tells Kite where to store record fields in HBase, which stores values in cells that are identified by column family and qualifier. Here is an example column mapping that stores the “title” field for each movie as a column with family “m” and qualifier “title”:

 

Each column needs a mapping in the schema’s “mapping” section. Kite doesn’t help generate this configuration yet, so you can download a finished version.

 

Next, a partition strategy is a configuration file that tells Kite how to store records. Kite uses the partition strategy to build a storage key for each record, and then organizes the dataset by those keys. For HDFS, keys identify the directory, or partition, where the data is stored, just like Hive. In HBase, a key uniquely identifies a record, and HBase groups them into files as needed.

Kite can also help create partition strategies using the partition-config command. It turns a series of source:type pairs into a formatted strategy. For this example, though, you are just using one field in the storage keys, the record id, copied into the key.

 

Source data for a partition is required, so another, minor, change in the movie-hbase.avsc file is that the id field doesn’t have a null option anymore.

Use --output again to save this to the local filesystem as id.json.

 

Loading the Data into HBase

Now that you have a schema with column mappings and a partition strategy file, you are ready to use the create command and load data. The only difference is that you add --partition-by to pass the partition config and --use-hbase to store data in HBase.

 

Conclusion

One of the hardest problems when moving data to Hadoop is figuring out how you want to organize it. What I’ve just demonstrated — loading data into both Hive and HBase with variations of the same commands — helps solve that problem by enabling you to try out new configurations quickly, even between very different storage systems. That’s more time to spend running performance tests, rather than debugging proof-of-concept code.

Ryan Blue is a Software Engineer at Cloudera, currently working on the Kite SDK.

Filed under:

7 Responses
  • Thanigai / June 06, 2014 / 6:17 PM

    This looks promising. Is there any documentation on how to use the Kite Dataset sink to be used with Flume?

  • Marco Shaw / June 18, 2014 / 8:23 AM

    Odd this dataset jar is not part of the Kite SDK? From what I remember, Kite is now included with the CDH 5.x Express VMs.

  • Ryan Blue / June 18, 2014 / 10:01 AM

    Marco, this is part of the Kite SDK, just not in the version included in CDH5. Kite’s data module is evolving quickly in the upstream community, and we won’t necessarily be able to backport the new features to CDH5 because of its strong compatibility guarantees. So we recommend people use the upstream version, which is easy because it’s a user library. And anyone that needs the compatibility guarantees can use the version bundled in CDH (the on-disk formats and layouts have not changed).

  • arbi / September 08, 2014 / 2:53 AM

    When calling ./dataset create movies –schema hdfs:/user/cloudera/schemas/movie.avsc I’m having following exception on a hadoop-2.5.0 cluster:
    Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/hadoop/hive/metastore/api/UnknownDBException
    at org.kitesdk.data.hcatalog.HCatalogMetadataProvider.getHcat(HCatalogMetadataProvider.java:50)
    at org.kitesdk.data.hcatalog.HCatalogMetadataProvider.exists(HCatalogMetadataProvider.java:86)
    at org.kitesdk.data.hcatalog.HCatalogManagedMetadataProvider.create(HCatalogManagedMetadataProvider.java:59)
    at org.kitesdk.data.hcatalog.HCatalogDatasetRepository.create(HCatalogDatasetRepository.java:64)
    at org.kitesdk.cli.commands.CreateDatasetCommand.run(CreateDatasetCommand.java:98)
    at org.kitesdk.cli.Main.run(Main.java:131)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.kitesdk.cli.Main.main(Main.java:183)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
    Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.metastore.api.UnknownDBException
    at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
    at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
    at java.security.AccessController.doPrivileged(Native Method)
    at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    … 13 more

  • Asmath / October 02, 2014 / 7:49 PM

    Hi,

    I am trying to run the examples under qickstartvm 5.1 version but the logging examples throws error stating that host etc are missing. Is it mandatory for the examples to run on the version of 4.4? cant the new versions of quickstart vm take advantage of running kite cdk examples?

    Thanks,
    Asmath

Leave a comment


− 4 = five