BigDL on CDH and Cloudera Data Science Workbench

Categories: CDH How-to Spark

Introduction

As companies strive to implement modern solutions based on deep learning frameworks, there is a need to deploy it on existing hardware infrastructure in a scalable and distributed manner comes to the fore. Recognizing this need, Cloudera’s and Intel’s Big Data Technologies engineering teams jointly detail Intel’s BigDL Apache Spark deep learning library on the latest release of Cloudera’s Data Science Workbench. This collaborative effort allows customers to build new deep learning applications with BigDL Spark Library by leveraging their existing homogeneous compute capacity of Xeon servers running Cloudera’s Enterprise without having to invest in expensive GPU farms and bringing up parallel frameworks such as TensorFlow or Caffe.

We suggest you look up our previous blog for a quick read into other deep learning frameworks on Cloudera’s Data Science Workbench.

BigDL – A Distributed Deep Learning Library for Apache Spark*

Spark

Github: github.com/intel-analytics/BigDL  

http://software.intel.com/ai

In 2016, Intel open sourced BigDL, a distributed Deep Learning library for Apache Spark* (BigDL Github). BigDL natively integrates into Spark, supports popular neural net topologies, achieves feature parity with other open-source deep learning frameworks with a Python API.

While providing a high-level control variables such as number of compute nodes, cores, and batch size, a BigDL application leverages stable CDH Spark infrastructure for node communications and resource management during its execution. BigDL applications can be written in either Python or Scala and achieve high performance through both algorithm optimization and taking advantage of intimate integration with Intel’s Math Kernel Library (MKL). Check out Intel’s BigDL portal for more details.

Cloudera’s – Data Science Workbench

The Data Science Workbench is Cloudera’s latest  offering in the area of machine learning application development. It is a web-based tool for data scientists allowing them to quickly develop and deploy machine learning applications at scale. It includes a collaborative sharing and monitoring option providing an opportunity for team-based collaboration across geos and timezones.

It can operate either on on-premise or across public clouds and is great complement to the CDH platform.  Cloudera Data Science Workbench provides a predictable, isolated file system and networking setup via Docker containers, for R, Python and Scala users.  Users do not have to worry about which libraries are installed on the host, port contention with other user’s processes on the host, and the admins do not have to worry about users adversely impacting the host or other user’s workloads.

In Cloudera Data Science Workbench, Docker will provide admins with a convenient way to package R and Python libraries for their users via the extensible engines feature. It is particularly attractive, as it lets us interact with the output, model and code, all the while being under the same cluster. It makes it very easy to analyze the results and tweak the model parameters to observe delta changes.

Running BigDL programs in Data Science Workbench

It goes without saying that colocating a data processing pipeline with a Deep Learning framework makes data exploration/algorithm and model evolution much simpler and at the same time makes data governance and lineage tracking a consolidated effort.

Running BigDL programs from the workbench is pretty straightforward. We can upload the compiled jar and the BigDL python API to our workbench and initialize the context using spark-defaults.conf. With the forthcoming BigDL release, the maven artifacts  can be directly used within the workbench.

For this blog, we used the following:

  • OS: CentOS 7.2
  • JDK: 1.8.0_60
  • Python: 2.7.11 (Anaconda distribution)
  • CDH: 5.11.0
  • Spark: 2.1.0

The workbench admin console gives us a few ways to apply a few settings system wide. The launched Docker session has a few directories loaded by default. To name a few, the parcels directory, the JAVA_HOME environment variable, kerberos settings etc. We can chose to add a few more user readable folders to be mounted if we have admin access. Or we can use the upload feature from per user settings to each project.

We configure the admin settings to specify the mount directories for our maven install as well as PYSPARK_PYTHON environment variable. Once the session is launched, these directories and variables become the part of the session.

We initialize the project using the Github link, where we have hosted the pre-compiled libraries/archives of interest. These are compiled and tested on CDH 5.11, Cloudera Data Science Workbench 1.0 and against JDK 1.8.0_60.

Create a New Project

The workbench gives us a convenient way to set environment variable across all sessions via the admin option or per project. A screenshot of the admin settings is included below:

Setting Environment Variables

Downloading and building the BigDL jars.

As an alternative to downloading a pre-compiled snapshot of BigDL tools as described above, one can download and build the latest release of BigDL directly from its github as described below:

New Project

If a non-master github branch (or tag) is desired, it can be easily accomplished via a simple series of commands issued from DSWB command-line interface, for example:

$!git clone https://github.com/intel-analytics/BigDL.git BigDL   # downloading git repo into  #a local ‘BigDL’ directory

 

Since we would like to have the ability to compile the library within our session, we would like to increase the Permanent Generation space within our JVM.

CDSW gives us a convenient way to set environment variable per session of a project. See below:

Project settings2

Once the session is up and running, we use the terminal to compile and generate the libraries.

Once the compilation is done, our libraries of interest would be in:

Let’s get interactive with a CNN based TextClassifier in The Data Science Workbench

We use Python 2.7.11 to run a CNN based text classifier training model. We reuse the example from the BigDL repository. For additional details about the implementation specific to BigDL, please refer to the source code or BigDL Python documentation. There are a few settings that we need to take care of.

To achieve high performance, BigDL uses Intel Math Kernal Library and multithreaded programming. Therefore, we must first set the relevant environment variables.

Alternatively, we can pass these parameters using Spark. We use spark-defaults.conf to initialize the SparkContext with the relevant parameters. We would also need to set these environment variables for the project to make these variables visible in the environment of the current session.      Engine Settings

Since, our Python version needs to be compatible with the Python version of the  container, we can either use Spark to distribute the relevant archive or use admin UI to set the environment variables  as shown earlier .

Since, we are running our program in an interactive format, let’s start with building our Spark Session. We will launch the session in a client mode.

We would further need to import BigDL Python APIs. If all has gone well with our Spark Session these imports should come into the session without any errors.

BigDL needs some environment variables to be set to perform optimally. Following takes care of it.

init_engine()

This example use a (pre-trained GloVe embedding) to convert word to vector, and uses it to train a CNN, LSTM or GRU text classification model on a 20 Newsgroup dataset with 20 different categories. In this subsection we load the data, prepare the sample set and split it into train and validation set and plot a sample of the data using wordcloud in our interactive session. Across different runs, the /tmp would contain the pre-trained data. If it is empty, running the example would download the data there.

Word Vector

We use this subsection to build the model layer by layer. Use layer.py file to understand the details of different transformations, layers and various modules available via BigDL python API.

And finally we configure an optimizer by defining and configuring a model, data, a loss function and several hyperparameters such as learning rate, weight decay etc and launch the actual training.

CNN model can achieve around 96% accuracy after 2 or 3 epochs training.

  • The accuracy with 3 epochs and a cnn model accuracy is around 95.6%.
  • The accuracy with 3 epochs and gru model accuracy is around 94.48%
  • The accuracy with 6 epochs and lstm model, accuracy is around 89.76%

The confusion matrix for the cnn model looks as follows:

Confusion Matrix

In order to launch this as a self-contained PySpark program from within the workbench we can use the following script, to validate how the the program works in true cluster mode.

Conclusion

In this blog post, we have demonstrated that BigDL can be easily deployed on existing CDH clusters. Leveraging BigDL Spark library on CDH, a user can easily write scalable distributed Deep Learning applications within familiar Spark infrastructure without an intimate knowledge of the configuration of the underlying compute cluster.

We used a relatively simple Python text classifier example using CNN. This is available in the BigDL repo as examples and were good places to start using the library.

If you have any questions regarding BigDL, you can raise your questions in BigDL Google Group.

Resources

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *