Cloudera Data Science Workbench provides data scientists with secure access to enterprise data with Python, R, and Scala. In the previous article, we introduced how to use your favorite Python libraries on an Apache Spark cluster with PySpark. In Python world, data scientists often want to use Python libraries, such as XGBoost, which includes C/C++ extension. This post shows how to solve this problem creating a conda recipe with C extension. The sample repository is here.
(Optional) Prepare Anaconda Docker image
We developed this recipe in an Anaconda image for Docker. The use of a Docker container helps ensure library compatibility. Using Docker container makes sure the environment isolated, so it prevents troubles from mixed environment both system installed and anaconda installed ones. macOS users, in particular, need to build inside a Docker container because the C libraries that come with a macOS are not compatible with what’s needed here.
Let’s prepare the docker image from anaconda. The following command are executed on your physical machine.
$ docker pull continuumio/anaconda $ docker run -i -v $(pwd):/root/mecab -t continuumio/anaconda /bin/bash
Note: Because of a glibc incompatibility, using a docker image doesn’t work to build a package for older Linux distributions such as CentOS 6.
Prepare Conda Environment
According to the official document, it is required to install GCC/G++. If you don’t have any development tools, you should ensure install dependent tools. For this recipe, install the required packages as follows (e.g. Debian/Ubuntu command):
$ sudo apt-get install g++ autoconf make
Before you start the development, you should install the conda build tool.
$ conda install conda-build $ conda upgrade conda $ conda upgrade conda-build
Write Your Recipe
After creating the working directory, write `meta.yaml` to set package name, version, source repositories, and dependencies. In this case, we will make a package named `mecab`.
$ cd ~/ $ mkdir mecab $ cd mecab
package: name: mecab version: "0.996" source: git_url: https://github.com/taku910/mecab git_rev: 32041d9504d11683ef80a6556173ff43f79d1268 build: number: 0 Requirements: Build: - g++ run: - libgcc about: home: http://taku910.github.io/mecab license: BSD,LGPL,GPL
Write Your Build Script
Write build.sh to compile your package.
# Build MeCab cd mecab ./configure --prefix=$PREFIX --with-charset=utf8 make make install # Build dictionary for MeCab cd ../mecab-ipadic ./configure --with-mecab-config=$PREFIX/bin/mecab-config --prefix=$PREFIX --with-charset=utf8 --with-dicdir=$PREFIX/lib/mecab/dic/ipadic make make install # Build MeCab Python wrapper cd ../mecab/python swig -python -shadow -c++ ../swig/MeCab.i python setup.py build python setup.py install --prefix=$PREFIX
NOTE: Don’t forget to set `$PREFIX` for every dependent components prefix. `$PREFIX` is a reserved environment variable. I will describe it later, but if you forget to add the `$PREFIX` into compilation option, the component will not be packaged into your conda package.
Build a Package
Let’s build your package. This command creates a conda package archived with tar.gz.
$ conda build .
After the build succeeds, you can install via local build files.
$ conda install mecab --use-local
Upload and Install Your Package from Anaconda Repository
If you want to distribute your package, you can use the anaconda repository. It requires the anaconda client, so you should install via conda.
$ conda install anaconda-client
After creating your anaconda.org account, you can log in and upload your package.
$ anaconda login # Input your anaconda.org username and password… $ anaconda upload /opt/conda/conda-bld/linux-64/mecab-0.996-1.tar.bz2
Now, we can install the package via anaconda repository.
$ conda install -c <your-anaconda-name> mecab
NOTE: Replace <your-anaconda-name> into your anaconda.org username.
How to use the package with PySpark on CDSW
You can create a conda environment with a C extension as follows:
$ conda create --copy -q -y -c chezou -n mecab_env python=2 mecab
After creating the environment, the directory structure is as follows:
.conda/envs/mecab_env/ bin/ conda-meta/ etc/ include/ lib/ libexec/ share/ ssl/
If you add `$PREFIX` in your conda build script, conda installs under the conda environment appropriately. With the `–copy` option, conda includes dependencies without symlinks; that’s why you can distribute C extension.
You can distribute the package with a conda environment. See more detail:
Creating a conda recipe enables you to use C extension based Python packages on a Spark cluster without installing them on each node using the Cloudera Data Science Workbench. Data scientists can run their favorite packages without modifying the cluster.
To learn more about the Data Science Workbench visit our website.
Aki Ariga is a Field Data Scientist at Cloudera and sparklyr contributor