What’s New in Kite SDK 0.15.0?

Kite SDK’s new release contains new improvements that make working with data easier.

Recently, Kite SDK, the open source toolset that helps developers build systems on the Apache Hadoop ecosystem, became a 0.15.0. In this post, you’ll get an overview of several new features and bug fixes.

Working with Datasets by URI

The new Datasets class lets you work with datasets based on individual dataset URIs. Previously, you had to open a dataset repository even if you only needed to do one action, like load a dataset. In 0.15.0, the right repository will automatically be used based on the dataset URI, so there is no need to work with repositories directly for most applications.

The old way:

 

The new way, 0.15.0 and later:

 

An added benefit of using a dataset URI rather than a repository URI is fewer configuration options. The Kite command-line tool now accepts a dataset URI in place dataset names and repository options like --use-hdfs and --directory. The repository still defaults to Apache Hive if just names are used.

The old way:

 

The new way, 0.15.0 and later:

 

Dataset URIs are defined by the dataset implementations, but are mostly made from adding a dataset name to the repository URI and changing the prefix to “dataset”. Here are the basic URI patterns:

  • Local FS – dataset:file:/<path>/<dataset-name>
  • HDFS – dataset:hdfs:/<path>/<dataset-name>
  • Hive (external) – dataset:hive:/<path>/<dataset-name>
  • Hive (managed) – dataset:hive?dataset=<dataset-name>
  • HBase – dataset:hbase:<zk-hosts>/<dataset-name>

This release also includes experimental support for view URIs, which will be expanded in the next release to support Apache Oozie integration.

Improved Configuration for MR and Apache Crunch Jobs

The MapReduce input and output formats now use dataset (or view) URIs and a configuration builder that is easier to read:

 

Crunch support received a similar update:

 

We’ve added a copy command to the CLI tool that can be used to bulk copy one dataset or view into another. By default, it compacts the dataset into one output file per partition, but can be used to do map-only copies. Using dataset URIs, the tool will copy between any two datasets, including datasets stored in HBase.

 

CSV imports will now use the same copy task and can import the data in parallel if the source CSV files are in HDFS.

Parent POM for Kite Applications

Kite 0.15.0 includes a Maven POM file that can be used to reduce the annoyance of managing Hadoop dependencies in Maven projects. You can add it to your project by adding it as a parent POM. Then, your project will inherit a consistent set of dependencies for Kite, Hadoop, and other integrated projects from the Kite application POM, and will be updated when you change your Kite version.

 

The application parent POM also configures Kite and Apache Avro / Maven integration plugins that you can turn on by adding a four-line plugin entry to your build.

 

The POMs for the Kite examples now use the Kite parent POM, and are much smaller as a result. The demo application’s POM is a good example of using Kite to manage Hadoop dependencies, and adding just the application-specific dependencies in the app’s POM.

Java Class Hints

We’ve added Java class arguments to the load, create, and update methods in the API that return datasets. The current DatasetRepository behavior hasn’t changed, but in the Datasets API, the class argument is needed any time you use specific or reflected objects.

 

This is needed for Kite to ensure it can produce the requested class or throw a helpful error message. Before this fix, Kite would happily produce a generic record if it can’t load your specific class, which causes a confusing ClassCastException somewhere in your code instead of telling you what went wrong.

The new class hint argument also makes it possible for you to request a generic object even if your specific class is available and it fixes a class loading bug that caused Avro to incorrectly load generic objects.

More Docs and Tutorials

The last addition this release is a new user guide on kitesdk.org, where we’re adding new tutorials and background articles. We’ve also updated the examples for the new features, which is a great place to learn more about Kite.

Also, watch this technical webinar on-demand to learn more about working with datasets in Kite.

Ryan Blue is a Software Engineer at Cloudera.

Filed under:

No Responses

Leave a comment


× seven = 63