Create conda recipe to use C extended Python library on PySpark cluster with Cloudera Data Science Workbench

by Cloudera

Posted in Technical | May 15, 2017 3 min read

Cloudera Data Science Workbench provides data scientists with secure access to enterprise data with Python, R, and Scala. In the previous article, we introduced how to use your favorite Python libraries on an Apache Spark cluster with PySpark. In Python world, data scientists often want to use Python libraries, such as XGBoost, which includes C/C++ extension. This post shows how to solve this problem creating a conda recipe with C extension. The sample repository is here.

See more detail of the conda package in the official document. Especially for C extension, this tutorial is a useful resource to read.

(Optional) Prepare Anaconda Docker image

We developed this recipe in an Anaconda image for Docker. The use of a Docker container helps ensure library compatibility. Using Docker container makes sure the environment isolated, so it prevents troubles from mixed environment both system installed and anaconda installed ones. macOS users, in particular, need to build inside a Docker container because the C libraries that come with a macOS are not compatible with what’s needed here.

Let’s prepare the docker image from anaconda. The following command are executed on your physical machine.

$ docker pull continuumio/anaconda
$ docker run -i -v $(pwd):/root/mecab -t continuumio/anaconda /bin/bash

Note: Because of a glibc incompatibility, using a docker image doesn’t work to build a package for older Linux distributions such as CentOS 6.

Prepare Conda Environment

According to the official document, it is required to install GCC/G++. If you don’t have any development tools, you should ensure install dependent tools. For this recipe, install the required packages as follows (e.g. Debian/Ubuntu command):

$ sudo apt-get install g++ autoconf make

Before you start the development, you should install the conda build tool.

$ conda install conda-build
$ conda upgrade conda
$ conda upgrade conda-build

Write Your Recipe

After creating the working directory, write `meta.yaml` to set package name, version, source repositories, and dependencies. In this case, we will make a package named `mecab`.

$ cd ~/
$ mkdir mecab
$ cd mecab

Example recipe:

package:
 name: mecab
 version: "0.996"
 
source:
 git_url: https://github.com/taku910/mecab
 git_rev: 32041d9504d11683ef80a6556173ff43f79d1268
 
build:
 number: 0
 
Requirements:
 Build:
   - g++
 run:
   - libgcc
 
about:
 home: http://taku910.github.io/mecab
 license: BSD,LGPL,GPL

Write Your Build Script

Write build.sh to compile your package.

Example script:

# Build MeCab
cd mecab
./configure --prefix=$PREFIX --with-charset=utf8

make
make install

# Build dictionary for MeCab
cd ../mecab-ipadic
./configure --with-mecab-config=$PREFIX/bin/mecab-config --prefix=$PREFIX --with-charset=utf8 --with-dicdir=$PREFIX/lib/mecab/dic/ipadic
make
make install

# Build MeCab Python wrapper
cd ../mecab/python
swig -python -shadow -c++ ../swig/MeCab.i
python setup.py build
python setup.py install --prefix=$PREFIX

NOTE: Don’t forget to set `$PREFIX` for every dependent components prefix. `$PREFIX` is a reserved environment variable. I will describe it later, but if you forget to add the `$PREFIX` into compilation option, the component will not be packaged into your conda package.

Build a Package

Let’s build your package. This command creates a conda package archived with tar.gz.

$ conda build .

After the build succeeds, you can install via local build files.

$ conda install mecab --use-local

Upload and Install Your Package from Anaconda Repository

If you want to distribute your package, you can use the anaconda repository. It requires the anaconda client, so you should install via conda.

$ conda install anaconda-client

After creating your anaconda.org account, you can log in and upload your package.

$ anaconda login
# Input your anaconda.org username and password…
$ anaconda upload /opt/conda/conda-bld/linux-64/mecab-0.996-1.tar.bz2

Now, we can install the package via anaconda repository.

$ conda install -c <your-anaconda-name> mecab

NOTE: Replace <your-anaconda-name> into your anaconda.org username.

How to use the package with PySpark on CDSW

You can create a conda environment with a C extension as follows:

$ conda create --copy -q -y -c chezou -n mecab_env python=2 mecab

After creating the environment, the directory structure is as follows:

.conda/envs/mecab_env/
	bin/
	conda-meta/
	etc/
	include/
	lib/
	libexec/
	share/
	ssl/

If you add `$PREFIX` in your conda build script, conda installs under the conda environment appropriately. With the `–copy` option, conda includes dependencies without symlinks; that’s why you can distribute C extension.

You can distribute the package with a conda environment. See more detail:

Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench

Summary

Creating a conda recipe enables you to use C extension based Python packages on a Spark cluster without installing them on each node using the Cloudera Data Science Workbench. Data scientists can run their favorite packages without modifying the cluster.

To learn more about the Data Science Workbench visit our website.

Aki Ariga is a Field Data Scientist at Cloudera and sparklyr contributor

Cloudera

More by this author

Editor's Choice

Business

Acquisition of Verta’s Operational AI Platform Will Transform Cloudera’s AI Vision to Reality