- by Aaron Kimball
- March 14, 2013
- no comments
The following guest post is provided by Aaron Kimball, CTO of WibiData.
The Kiji ecosystem has grown with the addition of a new module, KijiMR. The Kiji framework is a collection of components that offer developers a handle on building Big Data Applications. In addition to the first release, KijiSchema, we are now proud to announce the availability of a second component: KijiMR. KijiMR allows KijiSchema users to use MapReduce techniques including machine-learning algorithms and complex analytics to develop many kinds of applications using data in KijiSchema. Read on to learn more about the major features included in KijiMR and how you can use them.
KijiMR offers developers a set of new processing primitives explicitly designed for interacting with complex table-oriented data. The low-level batch interfaces available in MapReduce include basic InputFormat and OutputFormat implementations. The raw APIs are designed for processing key-value pairs stored in flat files in HDFS. Integrating MapReduce with HBase via InputFormat and OutputFormat APIs is hard to do from scratch in every algorithm. In KijiMR, we have extended the available MapReduce APIs to include:
- Bulk importers, which load data from external sources (like files in HDFS) into tables managed by KijiSchema
- Gatherers that allow you to scan over columns of a table, and emit key-value pairs for processing with a conventional MapReduce pipeline
- Producers, which implement computation functions that update individual rows in a Kiji table
Several features of KijiMR make developing MapReduce pipelines more friendly for developers. Using these specialized processing metaphors in addition to the more conventional map and reduce makes it easier to focus on building applications:
- Command-line tools in Kiji support specific kinds of jobs, making it easier to launch and test MapReduce jobs without writing tedious boilerplate code. With flexible parameters, users can specify complex options like input and output data formats as well as data to include in the distributed cache.
- Builder APIs make it easier to programmatically construct MapReduce jobs.
- Key-value store lookups enable distributed jobs to efficiently perform map-side joins, foreign key lookups, and machine learning model scoring tasks.
- KijiMR also includes a library of stock implementations for bulk importing from a variety of formats, KeyValueStore implementations for common file formats, standard aggregation reducers and more.
Producers and Gatherers
The core components of KijiMR are called producers and gatherers. A producer executes a function over a subset of the columns in a table row and produces output to be injected back into a column of that row. Producers can be run in a MapReduce job that operates over a range of rows from a Kiji table. Common tasks for producers include parsing, profiling, recommending, predicting, and classifying. For example, you might run a
LocationIPProducer to compute and store the location of each user into a new column, or a
PersonalizationProfileProducer to compute a personalization profile.
A producer updates each row with new information in an output column.
A Kiji Gatherer scans over the rows of a Kiji table using the MapReduce framework and outputs key-value pairs. Gatherers are a flexible job type and can be used to extract or aggregate information into a variety of formats based on the output specification and reducer used. Common tasks for gatherers include calculating sums across an entire table, extracting features to train a model, and pivoting information from one table into another. You should use a gatherer when you need to pull data out of a Kiji table into another format or to feed it into a reducer.
A gatherer scans over a set of columns in each row of a Kiji table, emitting key-value pairs to a reducer.
Model scoring with KeyValueStores
KeyValueStores allow processing pipeline elements like producers and gatherers to load data in external sources, beyond the specific record being processed. Users can specify data sets as key-value stores using the KeyValueStore API. User programs then use a
KeyValueStoreReader to look up values associated with input keys. These input keys may be defined by the records in a data set that the user is processing with MapReduce. In conjunction with a KijiProducer, users can access
KeyValueStores to apply the results of a trained machine learning model to their primary data set. The output of a machine learning model is often expressed as (key, value) pairs stored in files in HDFS, or in a secondary Kiji table. For each entry in a table, users can compute a new recommendation for the entry by applying the model to the information in the target row. A value in the user’s row may be a key into some arbitrary key-value store representing the model; the returned value is the recommendation.
KeyValueStores also support ordinary map-side joins in a MapReduce program, e.g., for denormalization of data. The smaller data set is loaded into RAM in each map task in the form of a
KeyValueStore instance. For each record in the larger dataset, users look up the corresponding record, and emit the concatenation of the two to the reducer.
Ready to Check Out KijiMR?
These features, as well as the ones provided in KijiSchema, are all available for download now. Get started with Kiji today by downloading a Bento Box. The Kiji Bento Box is a complete SDK for building Big Data Applications. All the components in Kiji are Apache 2.0-licensed open source. The Bento Box contains KijiSchema, KijiMR, developer tools, a standalone Hadoop/HBase instance, as well as example applications with source code that show how all the components fit together. The Kiji quick start guide can be completed in 15 minutes or less. For more information, see www.kiji.org.