How-to: Use IPython Notebook with Apache Spark

IPython Notebook and Spark’s Python API are a powerful combination for data science.

The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.

The developers of IPython have invested considerable effort in building the IPython Notebook, a system inspired by Mathematica that allows you to create “executable documents.” IPython Notebooks can integrate formatted text (Markdown), executable code (Python), mathematical formulae (LaTeX), and graphics/visualizations (matplotlib) into a single document that captures the flow of an exploration and can be exported as a formatted report or an executable script. Below are a few pieces on why IPython Notebooks can improve your productivity:

Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.

Software Prerequisites

  • IPython: I used IPython 1.x, since I’m running Python 2.6 on CentOS 6. This required me to install a few extra dependencies, like Jinja2, ZeroMQ, pyzmq, and Tornado, to allow the notebook functionality, as detailed in the IPython docs. :These requirements are only for the node on which IPython Notebook (and therefore the PySpark driver) will be running.
  • PySpark: I used the CDH-installed PySpark (1.x) running through YARN-client mode, which is our recommended method on CDH 5.1. It’s easy to use a custom Spark (or any commit from the repo) through YARN as well. Finally, this will also work with Spark standalone mode.

IPython Configuration

This installation workflow loosely follows the one contributed by Fernando Perez here. This should be performed on the machine where the IPython Notebook will be executed, typically one of the Hadoop nodes.

First create an IPython profile for use with PySpark.

This should have created the profile directory ~/.ipython/profile_pyspark/. Edit the file ~/.ipython/profile_pyspark/ipython_notebook_config.py to have:

If you want a password prompt as well, first generate a password for the notebook app:

and set the following in the same .../ipython_notebook_config.py file you just edited:

Finally, create the file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py with the following contents:

Starting IPython Notebook with PySpark

IPython Notebook should be run on a machine from which PySpark would be run on, typically one of the Hadoop nodes.

First, make sure the following environment variables are set:

Note that you must set whatever other environment variables you want to get Spark running the way you desire. For example, the settings above are consistent with running the CDH-installed Spark in YARN-client mode. If you wanted to run your own custom Spark, you could build it, put the JAR on HDFS, set the SPARK_JAR environment variable, along with any other necessary parameters. For example, see here for running a custom Spark on YARN.

Finally, decide from what directory to run the IPython Notebook. This directory will contain the .ipynb files that represent the different notebooks that can be served. See the IPython docs for more information. From this directory, execute:

Note that if you just want to serve the notebooks without initializing Spark, you can start IPython Notebook using a profile that does not execute the shell.py script in the startup file.

Example Session

At this point, the IPython Notebook server should be running. Point your browser to http://my.host.com:8880/, which should open up the main access point to the available notebooks. This should look something like this:

This will show the list of possible .ipynb files to serve. If it is empty (because this is the first time you’re running it) you can create a new notebook, which will also create a new .ipynb file. As an example, here is a screenshot from a session that uses PySpark to analyze the GDELT event data set:

The full .ipynb file can be obtained as a GitHub gist.

The notebook itself can be viewed (but not executed) using the public IPython Notebook Viewer.

Uri Laserson (@laserson) is a data scientist at Cloudera.

Filed under:

16 Responses
  • Steve Anton / August 14, 2014 / 12:06 PM

    Great post! I’m trying to follow along, but it seems like my Spark Context gets launched in local mode. Does this happen to you as well? I’m running PySpark 0.9.0 on CDH 5.0.

    • Uri Laserson (@laserson) / August 14, 2014 / 1:29 PM

      Thanks, Steve! I have not experienced that problem, but Spark configuration can be tricky. I would suggest moving this to the spark-user mailing list. Separately, the SparkContext gets instantiated when the shell.py script is run in the 00_pyspark_startup.py file that you’ve created. It’s influenced by the PYSPARK_SUBMIT_ARGS argument, which perhaps is not being set correctly?

  • Nam / August 14, 2014 / 8:51 PM

    Thanks for the great post! I tried it for my own use. However, I was wondering if it is possible to let multiple users access Spark via ipython notebook.. It seems we cannot run multiple ipynb servers at the same time due to SparkContext restriction, right? Then, is it ok to let multiple ipynb sessions running on the same ipynb server?

    • Uri Laserson (@laserson) / August 15, 2014 / 12:15 AM

      I am pretty sure you can run multiple servers. They just have to be on different ports. I am also pretty sure that one server can serve multiple IPython Notebooks at the same time. If I understand correctly, every notebook that is opened will start an IPython kernel, with its own SparkContext. The only issue is whether you can run multiple SparkContexts at the same time. This is definitely possible if you’re using YARN. Standalone mode is probably a problem, as I believe it will take all the resources on the cluster.

  • Nam / August 17, 2014 / 11:35 AM

    @Uri: I have Spark in standalone mode cluster, and that’s why I can only have 1 SparkContext at a time. Your reference is so helpful, I will try YARN soon. Thanks!

    @Steve: you may want to try launching the ipynb server using the following:

    IPYTHON_OPTS=”notebook –pylab inline –profile pyspark” /path/to/pyspark –master

  • Hari Sekhon / August 22, 2014 / 8:53 AM

    Funny timing, I wrote a python script to handle PySpark integrated IPython Notebook setup for my users just 2 weeks back.

    A couple points looking at this blog post today:

    1. It seems to be executing in LOCAL mode rather than YARN mode
    2. –master doesn’t seem to work in PYSPARK_SUBMIT_ARGS, although it does as an arg to pyspark (using 1.0.2)

    You can clone the git repo below and run ‘ipython-notebook-pyspark.py’ to try:

    https://github.com/harisekhon/toolbox

    Regards,

    Hari Sekhon

  • Hari Sekhon / August 22, 2014 / 9:32 AM

    Ah, I see looking back at the comments I’m not the only person that has found this. If calling pyspark beware it resets PYSPARK_SUBMIT_ARGS=”".

    Nam – yes you can have multiple, I wrote the script above to give each of my users their own PySpark integrated IPython Notebook – with a password they define at prompt the first time they run the script. It writes the configs and they can see the IP and port to connect to in the output (the script tries to figure out the IP of the interface with default gateway so it’s not reporting http://0.0.0.0:8888)

    The script also supports MASTER and PYSPARK_SUBMIT_ARGS environment variables so you can override the options for local vs standalone vs yarn mode or num executors / memory / cpu cores.

    It’s about 7 secs slower to initially start on YARN due to initializing and connecting to a new Application Master. After that it’s about the same for successive requests.

    I also found a library issue with python not finding pyspark library on cluster nodes. You probably won’t notice if you’re only running on a single node sandbox vm. I written a fix/workaround for that into the script using SPARK_YARN_USER_ENV as well as a few other pathing things like YARN_CONF_DIR and globbing of available py4j-*-src.zip since that will probably change on you.

    I’d recommend just running that script to handle the setup, otherwise it’s quite tedious and tricky to get right…

    Regards,

    Hari Sekhon

  • Hari Sekhon / August 22, 2014 / 9:34 AM

    To be clear that YARN vs LOCAL mode and PYSPARK_SUBMIT_ARGS issue was solved for me by not calling pyspark and using the script I wrote to handle all the setup, tweaks and fixes.

    Regards,

    Hari Sekhon.

    • Uri Laserson (@laserson) / August 22, 2014 / 10:59 AM

      Note that in my formulation, you do NOT call pyspark. Rather, you call the regular ipython executable with the pyspark profile. This should make sure the PYSPARK_SUBMIT_ARGS is treated correctly.

  • Nischal (@nischalhp) / October 06, 2014 / 10:58 PM

    I am trying to set this up but then i am not able to use SparkContext at all. When i start the ipython notebook is there something i need to look for.

    Any help would be appreciated.

    • Uri Laserson (@laserson) / October 07, 2014 / 12:17 AM

      Would you mind following up on this on the Spark users mailing list? Also, it would be helpful if you gave more details about what specific errors you’re getting.

  • Nischal (@nischalhp) / October 06, 2014 / 10:59 PM

    Also I am running a standalone version of Spark.

  • Arindam Paul / October 07, 2014 / 8:39 PM

    @NISCHALHP: Which version of Python are you using ? is it 3.4 ?

    if so, you need to change the print (it should have ‘()’), as shown below,
    python -c ‘from IPython.lib import passwd; print (passwd())’ > ~/.ipython/profile_pyspark/nbpasswd.txt

    Also check which version of py4j you have in your $SPARK_HOME
    python/lib/py4j-0.8.1-src.zip

    You may have python/lib/py4j-0.8.2.1-src.zip

  • allxone / October 21, 2014 / 10:35 PM

    I’m running Spark in Yarn client mode on a secure CDH 5.2. Any suggestion to let the notebook use Kerberos to authenticate with Yarn? Would be great also being able to impersonate Kerberos authenticated remote users.

    Regards,
    Stefano

  • Lukas / November 20, 2014 / 9:43 AM

    Thank you very much for posting this. I am a great fan of the ipython notebook and this will proof to increase the capabilities in our team a lot.
    I actually went ahead and python 2.7 in my user directory to be able to run the latest version of ipython.
    Some notes on the blog post: there is a typo in the configuration script: NoteBook instead of Notebook, also the indentation in the start script is a bit off.

  • praveen / November 21, 2014 / 3:57 AM

    What is the difference between the above procedure and the below command. The command starts a notebook and creates a spark context.

    IPYTHON_OPTS=”notebook –pylab inline” ./bin/pyspark

Leave a comment


one × = 3