*Thanks to Victor Bittorf, a visiting graduate computer science student at Stanford University, for the guest post below about how to use the new prebuilt analytic functions for Cloudera Impala.*

Cloudera Impala is an exciting project that unlocks interactive queries and SQL analytics on big data. Over the past few months I have been working with the Impala team to extend Impala’s analytic capabilities. Today I am happy to announce the availability of pre-built mathematical and statistical algorithms for the Impala community under a free open-source license. These pre-built algorithms combine recent theoretical techniques for shared nothing parallelization for analytics and the new user-defined aggregations (UDA) framework in Impala 1.2 in order to achieve big data scalability. This initial release has support for logistic regression, support vector machines (SVMs), and linear regression.

Having recently completed my masters degree while working in the database systems group at University of Madison Wisconsin, I’m excited to work with the Impala team on this project while I continue my research as a visiting student at Stanford. I’m going to go through some details about what we’ve implemented and how to use it.

As interest in data analytics increases, there is growing demand for deploying analytic algorithms in enterprise systems. One approach that has received much attention from researchers, engineers and data scientists is the integration of statistical data analysis into databases. One example of this is MADlib, which leverages the data-processing capabilities of an RDBMS to analyze data.

## Via model averaging, each Impala node can train a model using only the data that already resides there. |

UDAs have emerged as a popular way to implement analytic algorithms in an RDBMS. Many popular approaches to data analysis, such as stochastic gradient descent (SGD), train analytic models in an iterative way, which fits the UDA pattern of initialize, update, and finalize. This style of analytics was used by Bismarck as a way to unify in-RDBMS analytics. These kinds of algorithms incrementally improve a model by processing one tuple at a time while scanning over the database.

Since Impala allows us to push computation to the data, we employ recent academic work to distribute our computation. By using model averaging, each node in Impala can train a model using only the data that already resides on that node — shared-nothing analytics. Once all of the nodes have finished training their models, all of the models are collected and averaged together. These insights easily enable Impala to train and evaluate analytic models over data stored in Hadoop.

This package uses UDAs and UDFs when training and evaluating analytic models. While all of these tasks can be done in pure SQL using the Impala shell, we’ve put together some front-end scripts to streamline the process. The source code for the UDAs, UDFs, and scripts are all on GitHub.

We’ll go through some examples of data analytics using Cloudera’s Impala to demonstrate just how easy it is to use. First, I’ll explain how to interact with arrays and models using the Impala shell. Then, you’ll learn how to use simple command-line scripts to model classification tasks. This is a quick and easy way to get started with analytics in Impala. The wiki on GitHub has detailed documentation for developers.

## Arrays

Arrays of doubles are stored as byte strings in Impala. The UDAs and UDFs for training and evaluating models operate on the binary representation of the arrays. For convenience, we’ve put together some UDFs to convert arrays to and from ASCII printable text.

Constructing arrays is easy:

1 2 3 4 |
> SELECT printarray(toarray(1, 2, 3)); +--------------------------------------------------+ | <1, 2, 3> | +--------------------------------------------------+ |

The UDF toarray will create a byte string that represents the array. The UDF printarray converts this byte string to a human-readable string.

While Impala supports storage of binary strings, it can be useful to ASCII armor these arrays when working with them in a shell:

1 2 3 4 |
> SELECT encodearray(toarray(1, 2, 3)); +---------------------------------------------------+ | aaaaaipdaaaaaaaeaaaaaeae | +---------------------------------------------------+ |

This encoding of an array can be easily converted back to a binary encoding:

1 2 3 4 |
> SELECT printarray(decodearray('aaaaaipdaaaaaaaeaaaaaeae')); +--------------------------------------------------------------+ | <1, 2, 3> | +--------------------------------------------------------------+ |

This function lets you easily manage arrays using the Impala shell.

## Classification with SVM and Logistic Regression

Classification is a machine-learning task with many applications. Suppose a hospital is interested in preventing heart attacks. The hospital has a record of patients including risk factors such as anxiety and high blood pressure and whether they have had a heart attack. Classification techniques like SVMs and logistic regression can be used to predict the chance of a new patient having a heart attack based on these risk factors.

In this case, the training data would be a feature array of the risk factors and a Boolean label indicating if a patient with those risk factors has had a heart attack. After training the SVM, you can use the resulting model to predict if a new patient is at risk.

While a detailed explanation can be helpful (see links below), it is easy to get started.

– en.wikipedia.org/wiki/Support_vector_machine

– en.wikipedia.org/wiki/Logistic_regression

Suppose you have a table:

1 |
CREATE TABLE info (lbl boolean, e0 double, e1 double); |

Populate it with some feature vectors and labels:

1 2 |
INSERT INTO info VALUES (True, -3.33139160117, 2.16414460145), (True, -1.18726873679, 0.295350800223), (False, 3.60399430867, 1.45138655128); |

By executing a front-end Python script, you can use Impala to train an SVM with this data:

1 |
python python/impala_svm.py lbl e0 e1 --db toysvm --table info -y mymodel -e 3 |

Using the Impala shell, you can see that a table called mymodel has been created and populated with an SVM model:

1 2 3 4 5 6 7 8 9 |
> SELECT iter, printarray(decodearray(model)) FROM mymodel; +------+--------------------------------+ | iter | printarray(decodearray(model)) | +------+--------------------------------+ | 3 | <-0.699068, -0.0608898> | | 2 | <-0.591917, -0.0875452> | | 0 | <> | | 1 | <-0.479126, -0.115604> | +------+--------------------------------+ |

By passing ‘-e 3’ to the script, you instructed it to run for three iterations over the table to train the SVM. The results of the iterations are stored in this table. Notice that the first iteration is NULL, meaning you started with an initialized model (which defaults to all zeros).

Using the model from the third iteration (which encodes as ”nbgpcdplpjhgjhnl’), you can see how well our SVM did at learning the training examples:

1 2 3 4 5 6 7 8 |
> SELECT lbl, svmpredict(decodearray('nbgpcdplpjhgjhnl'), toarray(e0, e1)) FROM info; +-------+--------------------------------------------------------------+ | lbl | svmpredict(decodearray('nbgpcdplpjhgjhnl'), toarray(e0, e1)) | +-------+--------------------------------------------------------------+ | true | true | | true | true | | false | false | +-------+--------------------------------------------------------------+ |

The SVM successfully predicts all the training examples.

The Python script used to train the SVM, python/impala_svm.py, has options for setting the step size, step decay, and regularizer. These parameters can vary depending on the dataset. A common approach is to use a grid search to try multiple values in parallel to determine which ones work the best.

Logistic regression is supported and is as easy to use as an SVM. There is a convenient front end called impala_logr.py and there is a UDF called logrpredict, which is analogous to svmpredict.

## Linear Regression

This release also has a port of linear regression from MADlib. Since this is not an iterative algorithm, you can run it easily from the SQL prompt:

1 2 3 4 5 6 |
> SELECT printarray(linr(toarray(x, y), z)) from lrt; +---------------------------------------------------------+ | toysvm.printarray(toysvm.linr(toysvm.toarray(x, y), z)) | +---------------------------------------------------------+ | <0.6, 1.2> | +---------------------------------------------------------+ |

This models the relationship between (x, y) and z using MADlib’s linear regression module.

You can use this linear regression model in prediction tasks:

1 2 3 4 5 6 7 8 |
> select x, y, z, linrpredict(decodearray('kjjjjbpdkjjjjjpd'), toarray(x, y)) from lrt; +---+---+---+-----------------------------------------------------------+ | x | y | z |linrpredict(decodearray('kjjjjbpdkjjjjjpd'),toarray(x, y)) | +---+---+---+-----------------------------------------------------------+ | 1 | 2 | 3 | 3.00000011920929 | | 2 | 4 | 6 | 6.000000238418579 | | 1 | 2 | 3 | 3.00000011920929 | +---+---+---+-----------------------------------------------------------+ |

## Conclusion

The above examples are just a few of the data analytic algorithms that can be implemented in Impala and RDBMSs. The first release of this package includes some of the most popular statistical models but many more can be implemented or ported from existing libraries.

**Further Reading**

- “Towards a Unified Architecture for in-RDBMS Analytics” (SIGMOD 2012), by Xixuan Feng, Arun Kumar, Ben Recht and Christopher Ré

*Victor Bittorf is a third-year graduate student studying computer science. As a member of the Hazy Research Group, he studies scalable data analytics under the direction of Christopher Ré. Victor’s work focuses on understanding the data management challenges that arise in large-scale data analysis problems.*

Interesting read, thanks.

I was wondering about the performance of such analytics.

Can it serve as an ad-hoc exploration real time processing, or is it more of a batch processing?