Create conda recipe to use C extended Python library on PySpark cluster with Cloudera Data Science Workbench

Categories: CDH Data Science How-to Spark

Cloudera Data Science Workbench provides data scientists with secure access to enterprise data with Python, R, and Scala. In the previous article, we introduced how to use your favorite Python libraries on an Apache Spark cluster with PySpark. In Python world, data scientists often want to use Python libraries, such as XGBoost, which includes C/C++ extension. This post shows how to solve this problem creating a conda recipe with C extension. The sample repository is here.

See more detail of the conda package in the official document. Especially for C extension, this tutorial is a useful resource to read.

(Optional) Prepare Anaconda Docker image

We developed this recipe in an Anaconda image for Docker. The use of a Docker container helps ensure library compatibility. Using Docker container makes sure the environment isolated, so it prevents troubles from mixed environment both system installed and anaconda installed ones. macOS users, in particular, need to build inside a Docker container because the C libraries that come with a macOS are not compatible with what’s needed here.

Let’s prepare the docker image from anaconda. The following command are executed on your physical machine.

Note: Because of a glibc incompatibility, using a docker image doesn’t work to build a package for older Linux distributions such as CentOS 6.

Prepare Conda Environment

According to the official document, it is required to install GCC/G++. If you don’t have any development tools, you should ensure install dependent tools. For this recipe, install the required packages as follows (e.g. Debian/Ubuntu command):

Before you start the development, you should install the conda build tool.

Write Your Recipe

After creating the working directory, write meta.yaml to set package name, version, source repositories, and dependencies. In this case, we will make a package named mecab.

Example recipe:

Write Your Build Script

Write to compile your package.

Example script:

NOTE: Don’t forget to set $PREFIX for every dependent components prefix. $PREFIX is a reserved environment variable. I will describe it later, but if you forget to add the $PREFIX into compilation option, the component will not be packaged into your conda package.

Build a Package

Let’s build your package. This command creates a conda package archived with tar.gz.

After the build succeeds, you can install via local build files.

Upload and Install Your Package from Anaconda Repository

If you want to distribute your package, you can use the anaconda repository. It requires the anaconda client, so you should install via conda.

After creating your account, you can log in and upload your package.

Now, we can install the package via anaconda repository. 

NOTE: Replace <your-anaconda-name> into your username.

How to use the package with PySpark on CDSW

You can create a conda environment with a C extension as follows:

After creating the environment, the directory structure is as follows:

If you add $PREFIX in your conda build script, conda installs under the conda environment appropriately. With the --copy option, conda includes dependencies without symlinks; that’s why you can distribute C extension.

You can distribute the package with a conda environment. See more detail:

Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench


Creating a conda recipe enables you to use C extension based Python packages on a Spark cluster without installing them on each node using the Cloudera Data Science Workbench. Data scientists can run their favorite packages without modifying the cluster.

To learn more about the Data Science Workbench visit our website.

Aki Ariga is a Field Data Scientist at Cloudera and sparklyr contributor


2 responses on “Create conda recipe to use C extended Python library on PySpark cluster with Cloudera Data Science Workbench

  1. Varadarajan Ganesan

    Does this approach actually distribute the usage of XGBoost across all clusters? Or it allows to run XGBoost on the spark driver alone? Thanks for the clarification.