Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench

Categories: CDH Data Science How-to Spark

Cloudera Data Science Workbench provides freedom for data scientists. It gives them the flexibility to work with their favorite libraries using isolated environments with a container for each project.

In JVM world such as Java or Scala, using your favorite packages on a Spark cluster is easy. Each application manages preferred packages using fat JARs, and it brings independent environments with the Spark cluster. Many data scientists prefer Python to Scala for data science, but it is not straightforward to use a Python library on a PySpark cluster without modification. To solve this problem, data scientists are typically required to use the Anaconda parcel or a shared NFS mount to distribute dependencies. In the world of Python, it is standard to install packages with virtualenv/venv to isolated package environments before running code on their computer . Without virtualenv/venv, packages are directly installed on system directory, using virtualenv/venv makes sure to manage search path of appropriate packages in the specific directory. Supporting virtualenv is discussed on this JIRA, but basically, virtualenv is not something Spark will manage.

In this post, I will explain how to distribute your favorite Python library on PySpark cluster on The Cloudera Data Science Workbench Note that this technique is not the standard way in Python world. The original idea is written in this article.

The Goal

The goal of this article is to run Python code which uses a  pure Python library on a distributed PySpark cluster.

Example code:

The whole code is in the GitHub repository.

https://github.com/chezou/NLTK-pyspark

This code uses NLTK, Python’s natural language processing library. NLTK is not installed with conda by default. By adding import within your Python UDFs, you can use Python libraries. So let’s distribute NLTK with conda environment.

In this approach, libraries which require C extension might not work. I recommend creating conda recipe for libraries based on C extension. I will write a future article.

Create and pack conda environment

You can pack the python environment into the conda environment with conda create command. Open your workbench and run the following on your CDSW terminal:

If you want to add extra pip packages without conda, you should copy packages manually after using pip install. In Cloudera Data Science Workbench, pip will install the packages into ~/.local.

Be careful with using the --copy option which enables you to copy whole dependent packages into a certain directory of the conda environment.

Then Zip the conda environment for shipping on PySpark cluster.

(Optional) Prepare additional resources for distribution

If your code requires additional local data sources, such as taggers, you can both put data into HDFS and distribute archiving those files. For the sample code, you can use NLTK tokenizers and taggers which are assumed to be set on a local disk specified with environmental variables.

Put spark-defaults.conf on project root

You can set spark-submit option in spark-defaults.conf. Here is an example:

By setting spark.yarn.dist.archives=nltk_env.zip#NLTK, PySpark unzips nltk_env.zip into the NLTK directory. NLTK_DATA is the environmental variable where NLTK data is put.

Set the environment variable

You should set the environment variable PYSPARK_PYTHON as ./NLTK/nltk_env/bin/python. It enables PySpark to use Python within the conda environment.

Project Settings

After setting environment variables, you should restart your session.

Run your script on CDSW

Now is the time to run your Python script. Open your workbench and run nltk_sample.py.

After running that, you can check the result as follows:

Summary

Creating conda environment enables you to distribute your favorite Python packages without manual IT intervention using the Data Science Workbench tool by Cloudera. Data scientists can run their favorite packages without modifying your cluster.

To learn more about the Data Science Workbench visit our website.

Aki Ariga is a Field Data Scientist at Cloudera and sparklyr contributor

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

2 responses on “Use your favorite Python library on PySpark cluster with Cloudera Data Science Workbench

  1. Georgi

    Hi, why you ship a conda environment long with the job, instead of installing this environment across all cluster nodes and pointing the PYSPARK_PYTHON variable to the location of the python enterpreter within the conda environment? I think this is a viable option for an on premise cluster.

    1. Aki

      Hi Georgi,
      It may be possible to install all the workers. But, it requires for permission to install specific Python libraries, and PySpark doesn’t handle virtualenv/venv things. That means, it requires some other tools to install and isolate Python libraries on the cluster, and it still requires permission for administrators. If you don’t want to distribute packages each time, you can install them before you submit the spark app.

Leave a Reply

Your email address will not be published. Required fields are marked *

Prove you're human! *