Making Python on Apache Hadoop Easier with Anaconda and CDH

Categories: CDH Cloudera Manager Data Science Spark

Enabling Python development on CDH clusters (for PySpark, for example) is now much easier thanks to new integration with Continuum Analytics’ Python platform (Anaconda).

Python has become an increasingly popular tool for data analysis, including data processing, feature engineering, machine learning, and visualization. Data scientists and data engineers enjoy Python’s rich numerical and analytical libraries—such as NumPy, pandas, and scikit-learn—and have long wanted to apply them to large datasets stored in Apache Hadoop clusters.

While Apache Spark, through PySpark, has made data in Hadoop clusters more accessible to Python users, actually using these libraries on a Hadoop cluster remains challenging. In particular, setting up a full-featured and modern Python environment on a cluster can be challenging, error-prone, and time-consuming.

For these reasons, Continuum Analytics and Cloudera have partnered to create an Anaconda parcel for CDH to enable simple distribution and installation of popular Python packages and their dependencies. Anaconda dramatically simplifies installation and management of popular Python packages and their dependencies, and this new parcel makes it easy for CDH users to deploy Anaconda across a Hadoop cluster for use in PySpark, Hadoop Streaming, and other contexts where Python is available and useful.

The newly available Anaconda parcel:

  • Includes 300+ of the most popular Python packages
  • Simplifies the installation of Anaconda across a CDH cluster
  • Will be updated with each new Anaconda release.

In the remainder of this blog post, you’ll learn how to install and configure the Anaconda parcel, as well as explore an example of training a scikit-learn model on a single node and then using the model to make predictions on data in a cluster.

Installing the Anaconda Parcel

  1. From the Cloudera Manager Admin Console, click the “Parcels” indicator in the top navigation bar.

    anaconda-parcel-1-indicator

  2. Click the “Edit Settings” button on the top right of the Parcels page.

    anaconda-parcel-2-edit-settings

  3. Click the plus symbol in the “Remote Parcel Repository URLs” section, and add the following repository URL for the Anaconda parcel: https://repo.continuum.io/pkgs/misc/parcels/

    anaconda-parcel-3-repo-url

  4. Click the “Save Changes” button at the top of the page.

    anaconda-parcel-4-save-changes

  5. Click the “Parcels” indicator in the top navigation bar to return to the list of available parcels, where you should see the latest version of the Anaconda parcel that is available.
  6. Click the “Download” button to the right of the Anaconda parcel listing.

    anaconda-parcel-5-download

  7. After the parcel is downloaded, click the “Distribute” button to distribute the parcel to all of the cluster nodes.

    anaconda-parcel-6-distribute

  8. After the parcel is distributed, click the “Activate” button to activate the parcel on all of the cluster nodes, which will prompt with a confirmation dialog.

    anaconda-parcel-7a-activateanaconda-parcel-7b-confirmation

  9. After the parcel is activated, Anaconda is now available on all of the cluster nodes.

    anaconda-parcel-8-activated

These instructions are current as of the day of publication. Up-to-date instructions will be maintained in Anaconda’s documentation.

To make Spark aware that you want to use the installed parcels as the Python runtime environment on the cluster, you need to set the PYSPARK_PYTHON environment variable. Spark determines which Python interpreter to use by checking the value of the PYSPARK_PYTHON environment variable on the driver node. With the default configuration for Cloudera Manager and parcels, Anaconda will be installed to /opt/cloudera/parcels/Anaconda, but if the parcel directory for Cloudera Manager has been changed, you will need to change the below instructions to ${YOUR_PARCEL_DIR}/Anaconda/bin/python.

To specify which Python to use on a per-application basis, you can specify it on the same line as your spark-submit command. This would look like:

You can also use Anaconda by default in Spark applications while still allowing users to override the value if they wish. To do this, you will need follow the instructions for Advanced Configuration Snippets and add the following lines to Spark’s configuration:

Now with Anaconda on your CDH cluster, there’s no need to manually install, manage, and provision Python packages on your Hadoop cluster.

Anaconda in Action

A commonly needed workflow for a Python using data scientist is to:

  1. Train a scikit-learn model on a single node.
  2. Save the results to disk.
  3. Apply the trained model using PySpark to generate predictions on a larger dataset.

Let’s take a classic machine-learning classification problem as an example of what having complex Python dependencies from Anaconda installed on CDH cluster allows you to do.

The MNIST dataset is a canonical machine-learning classification problem that involves recognizing handwritten digits, where each row of the dataset consists of a representation of one handwritten digit from 0 to 9. The training data you will use is the original MNIST dataset (60,000 rows). The prediction will be done with the MNIST8M dataset (8,000,000 rows). Both of these datasets are available from the libsvm datasets website. This dataset is used as a standard test for various machine-learning algorithms. More information, including benchmarks, can be found on the MNIST Dataset website.

To train the model on a single node, you will use scikit-learn and then save the model to a file with pickle:

With the classifier now trained, you can save it to disk and then copy it to HDFS.

Next, configure and create a SparkContext to run in yarn-client mode:

To load the MNIST8M data from HDFS into an RDD:

Now let’s do some preprocessing on this dataset to convert the text to a NumPy array, which will serve as input for the scikit-learn classifier. You’ve installed Anaconda on every cluster node, so both NumPy and scikit-learn are available to the Spark worker processes.

To import the scikit-learn model and load the training data:

To apply the trained model to a data in a large file in HDFS, you need the trained model available in memory on the executors. To move the classifier from one node to all of the Spark workers, you can then use the SparkContext.broadcast function to:

This broadcast variable is then available in executors; thus you can use the variable in logic that needs to be executed on the cluster (inside of map or flatMap functions, for example). It is simple to apply the trained model and save the output to a file:

To submit this code as a script, add the environment variable declaration at the beginning and then the usual spark-submit command:

Conclusion

Getting started with Anaconda on your CDH cluster is easy with the newly available parcel. Be sure to check out the Anaconda parcel documentation for more details; support is available through Continuum Analytics.

Juliet Hougland is a Data Scientist at Cloudera.

Kristopher Overholt is a Software Engineer at Continuum Analytics.

Daniel Rodriguez is a Data Scientist at Continuum Analytics.

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

25 responses on “Making Python on Apache Hadoop Easier with Anaconda and CDH

  1. Diego

    Hi guys
    I have an error, doesn’t work the number 5 step on my CDH 5.5. I don’t see anaconda parcel for download.

    Thanks,
    Diego.

    Click the “Parcels” indicator in the top navigation bar to return to the list of available parcels, where you should see the latest version of the Anaconda parcel that is available.

  2. Slavo

    Hi,
    really interesting feature, which we could use in our project. Used to have a lot of trouble with python and how to integrate it with CDH.
    Followed the steps, but got stuck at the activation part. One of the nodes didn’t complete the activation. Can’t cancel it. Can’t restart it.
    Any possibility how to identify the process and kill it? Thanks in advance.
    Regards,
    Slavo

      1. Slavo

        Hi Justin,
        no, was probably just a one time issue. Had to cancel it and restart. Using it already and looks really good. Exactly what we needed. Thanks a lot!
        Regards,
        Slavo

      1. Andres

        Justin,
        I looked at the directory in my installation and I only see python2.7. Anaconda on my laptop does allow me to change my version of Python (say, to python3.4) by creating environments through the conda tool but I don’t see any evidence of conda on my system. Any hints on what I am missing?

        Thanks!

          1. Michele Chambers

            The Anaconda parcel is a CDH-compatible, relocatable version of the open source Anaconda platform that can be easily installed on your CDH cluster. The current version of the Anaconda parcel is based on Python 2.7

      1. Michele Chambers

        The Anaconda parcel for CDH is free and does not require a commercial agreement with Continuum Analytics. 

        However, the Anaconda parcel is a CDH-compatible, relocatable version of the open source Anaconda platform that allows you to get started with easy installation of the Anaconda distribution on your CDH cluster. The Anaconda parcel includes 300 packages but does not have the ability to add, create, or publish new packages.

        The commercial Anaconda subscriptions include enterprise-ready features such as: 
        • Repository with 700+ packages
        • Enterprise security and governance
        • Push local environments to all cluster nodes
        • Manage multiple packages and environment (including Python and R) alongside a CDH cluster
        • High performance scale up to multi-cores, GPUs
        • Enterprise Notebooks to speed up collaboration among data science teams
        • Integration with on-premises repository and enterprise notebooks

  3. mark

    So why use this via a parcel over just getting the packages from the flavor of linux you are using ?
    I mean now you stuff has to change run time environments…etc ..etc ..etc and the confusion of if something is coming from the OS pacakge manager or the CDH parcel.

    should cloudera just go ahead and make Cloudera Linux ? Seems like that is the route Cloudera wants to go without just doing it.

    1. Justin Kestelyn Post author

      Mark,

      The purpose of the parcel is to automate the config needed to run Python on your CDH cluster. It’s not for installing python packages on “any” Linux system.

  4. Petrus

    Dear,
    I’m trying in Cloudera virtual machine test(Quickstart V5.5.0)
    I installed and activated, but when I try to use the HDFS stops and doesn’t work.
    I received that error:
    INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: there are no corrupt file blocks.

    When I deactivate everything works.

    Anyone know what is?

  5. ph

    This blog has the same typo a previous Cloudera blog http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/ had where the spark-env.sh script has the wrong string test command (-n instead of -z), meaning it will never use a user-specified PYSPARK_PYTHON. Probably want to fix this one too…
    Alas, my company is on Hortonworks and I’m guessing there’s no way to install it outside of CDH (?). Cloudera’s features, blogs, and docs always seem better than Hortonworks.

  6. Ruslan

    That’s awesome. Quick question on this part of the article:
    “To do this, you will need follow the instructions for Advanced Configuration Snippets and add the following lines to Spark’s configuration:
    if [ -z “${PYSPARK_PYTHON}” ]; then
    export PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python
    fi

    Do I have to add that in CM to
    “Spark Client Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh”
    or
    “Spark Service Advanced Configuration Snippet (Safety Valve) for spark-conf/spark-env.sh”
    or both?

    Thank you.

  7. kumar77

    Hi,
    I got the following error when try to setup the variable;
    PYSPARK_PYTHON=/opt/cloudera/parcels/Anaconda/bin/python spark-submit pyspark_script.py
    /opt/cloudera/parcels/Anaconda/bin/python: can’t open file ‘/etc/profile.d/pyspark_script.py’: [Errno 2] No such file or directory

    Need help.
    Rgds,
    Kumar

  8. Mladen Trampic

    Greetings, i have CDH 5.6 cluster running. I’ve configured Anaconda Parcel, and made pyspark from shell and spark-submit to work with Anaconda parcel, but i am having difficulties to make HUE Spark Notebook work with Anaconda Pyspark. Do you have any advice’s how to make HUE use Anaconda Parcel, when using pyspark in spark notebook?

Leave a Reply

Your email address will not be published. Required fields are marked *