How-to: Use IPython Notebook with Apache Spark

Categories: How-to Spark

IPython Notebook and Spark’s Python API are a powerful combination for data science.

The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.

The developers of IPython have invested considerable effort in building the IPython Notebook, a system inspired by Mathematica that allows you to create “executable documents.” IPython Notebooks can integrate formatted text (Markdown), executable code (Python), mathematical formulae (LaTeX), and graphics/visualizations (matplotlib) into a single document that captures the flow of an exploration and can be exported as a formatted report or an executable script. Below are a few pieces on why IPython Notebooks can improve your productivity:

Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.

Software Prerequisites

  • IPython: I used IPython 1.x, since I’m running Python 2.6 on CentOS 6. This required me to install a few extra dependencies, like Jinja2, ZeroMQ, pyzmq, and Tornado, to allow the notebook functionality, as detailed in the IPython docs. :These requirements are only for the node on which IPython Notebook (and therefore the PySpark driver) will be running.
  • PySpark: I used the CDH-installed PySpark (1.x) running through YARN-client mode, which is our recommended method on CDH 5.1. It’s easy to use a custom Spark (or any commit from the repo) through YARN as well. Finally, this will also work with Spark standalone mode.

IPython Configuration

This installation workflow loosely follows the one contributed by Fernando Perez here. This should be performed on the machine where the IPython Notebook will be executed, typically one of the Hadoop nodes.

First create an IPython profile for use with PySpark.

This should have created the profile directory ~/.ipython/profile_pyspark/. Edit the file ~/.ipython/profile_pyspark/ to have:

If you want a password prompt as well, first generate a password for the notebook app:

and set the following in the same .../ file you just edited:

Finally, create the file ~/.ipython/profile_pyspark/startup/ with the following contents:

Starting IPython Notebook with PySpark

IPython Notebook should be run on a machine from which PySpark would be run on, typically one of the Hadoop nodes.

First, make sure the following environment variables are set:

Note that you must set whatever other environment variables you want to get Spark running the way you desire. For example, the settings above are consistent with running the CDH-installed Spark in YARN-client mode. If you wanted to run your own custom Spark, you could build it, put the JAR on HDFS, set the SPARK_JAR environment variable, along with any other necessary parameters. For example, see here for running a custom Spark on YARN.

Finally, decide from what directory to run the IPython Notebook. This directory will contain the .ipynb files that represent the different notebooks that can be served. See the IPython docs for more information. From this directory, execute:

Note that if you just want to serve the notebooks without initializing Spark, you can start IPython Notebook using a profile that does not execute the script in the startup file.

Example Session

At this point, the IPython Notebook server should be running. Point your browser to, which should open up the main access point to the available notebooks. This should look something like this:

This will show the list of possible .ipynb files to serve. If it is empty (because this is the first time you’re running it) you can create a new notebook, which will also create a new .ipynb file. As an example, here is a screenshot from a session that uses PySpark to analyze the GDELT event data set:

The full .ipynb file can be obtained as a GitHub gist.

The notebook itself can be viewed (but not executed) using the public IPython Notebook Viewer.

Learn more about Spark’s role in an enterprise data hub (EDH) here.

Uri Laserson (@laserson) is a data scientist at Cloudera.


23 responses on “How-to: Use IPython Notebook with Apache Spark

  1. Steve Anton

    Great post! I’m trying to follow along, but it seems like my Spark Context gets launched in local mode. Does this happen to you as well? I’m running PySpark 0.9.0 on CDH 5.0.

    1. Uri Laserson (@laserson) Post author

      Thanks, Steve! I have not experienced that problem, but Spark configuration can be tricky. I would suggest moving this to the spark-user mailing list. Separately, the SparkContext gets instantiated when the script is run in the file that you’ve created. It’s influenced by the PYSPARK_SUBMIT_ARGS argument, which perhaps is not being set correctly?

  2. Nam

    Thanks for the great post! I tried it for my own use. However, I was wondering if it is possible to let multiple users access Spark via ipython notebook.. It seems we cannot run multiple ipynb servers at the same time due to SparkContext restriction, right? Then, is it ok to let multiple ipynb sessions running on the same ipynb server?

    1. Uri Laserson (@laserson) Post author

      I am pretty sure you can run multiple servers. They just have to be on different ports. I am also pretty sure that one server can serve multiple IPython Notebooks at the same time. If I understand correctly, every notebook that is opened will start an IPython kernel, with its own SparkContext. The only issue is whether you can run multiple SparkContexts at the same time. This is definitely possible if you’re using YARN. Standalone mode is probably a problem, as I believe it will take all the resources on the cluster.

  3. Nam

    @Uri: I have Spark in standalone mode cluster, and that’s why I can only have 1 SparkContext at a time. Your reference is so helpful, I will try YARN soon. Thanks!

    @Steve: you may want to try launching the ipynb server using the following:

    IPYTHON_OPTS=”notebook –pylab inline –profile pyspark” /path/to/pyspark –master

  4. Hari Sekhon

    Funny timing, I wrote a python script to handle PySpark integrated IPython Notebook setup for my users just 2 weeks back.

    A couple points looking at this blog post today:

    1. It seems to be executing in LOCAL mode rather than YARN mode
    2. –master doesn’t seem to work in PYSPARK_SUBMIT_ARGS, although it does as an arg to pyspark (using 1.0.2)

    You can clone the git repo below and run ‘’ to try:


    Hari Sekhon

  5. Hari Sekhon

    Ah, I see looking back at the comments I’m not the only person that has found this. If calling pyspark beware it resets PYSPARK_SUBMIT_ARGS=””.

    Nam – yes you can have multiple, I wrote the script above to give each of my users their own PySpark integrated IPython Notebook – with a password they define at prompt the first time they run the script. It writes the configs and they can see the IP and port to connect to in the output (the script tries to figure out the IP of the interface with default gateway so it’s not reporting

    The script also supports MASTER and PYSPARK_SUBMIT_ARGS environment variables so you can override the options for local vs standalone vs yarn mode or num executors / memory / cpu cores.

    It’s about 7 secs slower to initially start on YARN due to initializing and connecting to a new Application Master. After that it’s about the same for successive requests.

    I also found a library issue with python not finding pyspark library on cluster nodes. You probably won’t notice if you’re only running on a single node sandbox vm. I written a fix/workaround for that into the script using SPARK_YARN_USER_ENV as well as a few other pathing things like YARN_CONF_DIR and globbing of available py4j-* since that will probably change on you.

    I’d recommend just running that script to handle the setup, otherwise it’s quite tedious and tricky to get right…


    Hari Sekhon

  6. Hari Sekhon

    To be clear that YARN vs LOCAL mode and PYSPARK_SUBMIT_ARGS issue was solved for me by not calling pyspark and using the script I wrote to handle all the setup, tweaks and fixes.


    Hari Sekhon.

    1. Uri Laserson (@laserson) Post author

      Note that in my formulation, you do NOT call pyspark. Rather, you call the regular ipython executable with the pyspark profile. This should make sure the PYSPARK_SUBMIT_ARGS is treated correctly.

  7. Nischal (@nischalhp)

    I am trying to set this up but then i am not able to use SparkContext at all. When i start the ipython notebook is there something i need to look for.

    Any help would be appreciated.

    1. Uri Laserson (@laserson) Post author

      Would you mind following up on this on the Spark users mailing list? Also, it would be helpful if you gave more details about what specific errors you’re getting.

  8. Arindam Paul

    @NISCHALHP: Which version of Python are you using ? is it 3.4 ?

    if so, you need to change the print (it should have ‘()’), as shown below,
    python -c ‘from IPython.lib import passwd; print (passwd())’ > ~/.ipython/profile_pyspark/nbpasswd.txt

    Also check which version of py4j you have in your $SPARK_HOME

    You may have python/lib/

  9. allxone

    I’m running Spark in Yarn client mode on a secure CDH 5.2. Any suggestion to let the notebook use Kerberos to authenticate with Yarn? Would be great also being able to impersonate Kerberos authenticated remote users.


  10. Lukas

    Thank you very much for posting this. I am a great fan of the ipython notebook and this will proof to increase the capabilities in our team a lot.
    I actually went ahead and python 2.7 in my user directory to be able to run the latest version of ipython.
    Some notes on the blog post: there is a typo in the configuration script: NoteBook instead of Notebook, also the indentation in the start script is a bit off.

  11. praveen

    What is the difference between the above procedure and the below command. The command starts a notebook and creates a spark context.

    IPYTHON_OPTS=”notebook –pylab inline” ./bin/pyspark

  12. Bin Wang

    Thanks for this great post Uri, I have one question related to the environment. My cluster is redhat6.5 where python2.6 comes as default. Does that mean that I have to use iPython1.X so it is compatible with Python2.6… I am wondering if there is an easy way to use the latest Anaconda Python which is 2.7, does that mean I also have to install Anaconda Python on every node and make it the default python interpreter(I have a feeling it will totally screw up the existing environment)?

  13. Raj

    Spark context is not available ,while i am running Ipython notebbok.
    Anaconda python 2.7. CDH Spark. Yarn client

    Got this exception message –
    Exception in thread “main” java.lang.UnsupportedClassVersionError: org/apache/hadoop/fs/FSDataInputStream : Unsupported major.minor version 51.0
    at java.lang.ClassLoader.defineClass1(Native Method)
    at java.lang.ClassLoader.defineClass(
    at Method)
    at java.lang.ClassLoader.loadClass(
    at sun.misc.Launcher$AppClassLoader.loadClass(
    at java.lang.ClassLoader.loadClass(
    at org.apache.spark.deploy.SparkSubmitDriverBootstrapper$.main(SparkSubmitDriverBootstrapper.scala:71)
    at org.apache.spark.deploy.SparkSubmitDriverBootstrapper.main(SparkSubmitDriverBootstrapper.scala)

  14. Arnab Dutta

    Finally set this up on CDH5 cluster on CentOS. for the latest ipython 4.0, there were issues loading the profile=pyspark as it was not found for starting notebook. Then I used this
    jupyter notebook –no-browser –ip=”*”

    This started the ipython notebook on localhost:port.
    Then run this code in the ipython console
    import findspark
    import pyspark
    sc = pyspark.SparkContext()

    # you may need to pip install findspark

  15. Eric

    I have a Cloudera 5 distribution with Spark 1.3. I installed IPython 1.2.1 to match with the Python 2.6.6. on the CentOS 6. I followed this tutorial by placing the 2 files (’’ & ‘’; with proper Spark Directory listed) in my management node home directory and I SSH’d into the management node and first created the environment variable for SPARK_HOME. Then I launched “ipython notebook –profile=pyspark”. Although after launching Python 2 in Ipython Notebook browser, “from pyspark import SparkConf, SparkContext” and “sc” commands both returned import errors not recognizing the libraries.

    I tried a different way by first declaring
    ” export SPARK_HOME=’/opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p898.573/lib/spark’
    export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
    export PYSPARK_DRIVER_PYTHON_OPTS=”notebook –profile=pyspark” ”

    And launching with “pyspark”

    This gave me a more promising error. “ImportError: No module named ‘SocketServer'” ….when trying to run “from pyspark.context import SparkContext”

    Any ideas?

Leave a Reply

Your email address will not be published. Required fields are marked *