How-to: Use IPython Notebook with Apache Spark

IPython Notebook and Spark’s Python API are a powerful combination for data science.

The developers of Apache Spark have given thoughtful consideration to Python as a language of choice for data analysis. They have developed the PySpark API for working with RDDs in Python, and further support using the powerful IPythonshell instead of the builtin Python REPL.

The developers of IPython have invested considerable effort in building the IPython Notebook, a system inspired by Mathematica that allows you to create "executable documents." IPython Notebooks can integrate formatted text (Markdown), executable code (Python), mathematical formulae (LaTeX), and graphics/visualizations (matplotlib) into a single document that captures the flow of an exploration and can be exported as a formatted report or an executable script. Below are a few pieces on why IPython Notebooks can improve your productivity:

Here I will describe how to set up IPython Notebook to work smoothly with PySpark, allowing a data scientist to document the history of her exploration while taking advantage of the scalability of Spark and Apache Hadoop.

Software Prerequisites

  • IPython: I used IPython 1.x, since I’m running Python 2.6 on CentOS 6. This required me to install a few extra dependencies, like Jinja2, ZeroMQ, pyzmq, and Tornado, to allow the notebook functionality, as detailed in the IPython docs. :These requirements are only for the node on which IPython Notebook (and therefore the PySpark driver) will be running.
  • PySpark: I used the CDH-installed PySpark (1.x) running through YARN-client mode, which is our recommended method on CDH 5.1. It’s easy to use a custom Spark (or any commit from the repo) through YARN as well. Finally, this will also work with Spark standalone mode.

IPython Configuration

This installation workflow loosely follows the one contributed by Fernando Perez here. This should be performed on the machine where the IPython Notebook will be executed, typically one of the Hadoop nodes.

First create an IPython profile for use with PySpark.

 

This should have created the profile directory ~/.ipython/profile_pyspark/. Edit the file ~/.ipython/profile_pyspark/ipython_notebook_config.py to have:

 

If you want a password prompt as well, first generate a password for the notebook app:

 

and set the following in the same .../ipython_notebook_config.py file you just edited:

 

Finally, create the file ~/.ipython/profile_pyspark/startup/00-pyspark-setup.py with the following contents:

 

Starting IPython Notebook with PySpark

IPython Notebook should be run on a machine from which PySpark would be run on, typically one of the Hadoop nodes.

First, make sure the following environment variables are set:

 

Note that you must set whatever other environment variables you want to get Spark running the way you desire. For example, the settings above are consistent with running the CDH-installed Spark in YARN-client mode. If you wanted to run your own custom Spark, you could build it, put the JAR on HDFS, set the SPARK_JAR environment variable, along with any other necessary parameters. For example, see here for running a custom Spark on YARN.

Finally, decide from what directory to run the IPython Notebook. This directory will contain the .ipynb files that represent the different notebooks that can be served. See the IPython docs for more information. From this directory, execute:

 

Note that if you just want to serve the notebooks without initializing Spark, you can start IPython Notebook using a profile that does not execute the shell.py script in the startup file.

Example Session

At this point, the IPython Notebook server should be running. Point your browser to http://my.host.com:8880/, which should open up the main access point to the available notebooks. This should look something like this:

This will show the list of possible .ipynb files to serve. If it is empty (because this is the first time you’re running it) you can create a new notebook, which will also create a new .ipynb file. As an example, here is a screenshot from a session that uses PySpark to analyze the GDELT event data set:

The full .ipynb file can be obtained as a GitHub gist.

The notebook itself can be viewed (but not executed) using the public IPython Notebook Viewer.

Uri Laserson (@laserson) is a data scientist at Cloudera.

Filed under:

9 Responses
  • Steve Anton / August 14, 2014 / 12:06 PM

    Great post! I’m trying to follow along, but it seems like my Spark Context gets launched in local mode. Does this happen to you as well? I’m running PySpark 0.9.0 on CDH 5.0.

    • Uri Laserson (@laserson) / August 14, 2014 / 1:29 PM

      Thanks, Steve! I have not experienced that problem, but Spark configuration can be tricky. I would suggest moving this to the spark-user mailing list. Separately, the SparkContext gets instantiated when the shell.py script is run in the 00_pyspark_startup.py file that you’ve created. It’s influenced by the PYSPARK_SUBMIT_ARGS argument, which perhaps is not being set correctly?

  • Nam / August 14, 2014 / 8:51 PM

    Thanks for the great post! I tried it for my own use. However, I was wondering if it is possible to let multiple users access Spark via ipython notebook.. It seems we cannot run multiple ipynb servers at the same time due to SparkContext restriction, right? Then, is it ok to let multiple ipynb sessions running on the same ipynb server?

    • Uri Laserson (@laserson) / August 15, 2014 / 12:15 AM

      I am pretty sure you can run multiple servers. They just have to be on different ports. I am also pretty sure that one server can serve multiple IPython Notebooks at the same time. If I understand correctly, every notebook that is opened will start an IPython kernel, with its own SparkContext. The only issue is whether you can run multiple SparkContexts at the same time. This is definitely possible if you’re using YARN. Standalone mode is probably a problem, as I believe it will take all the resources on the cluster.

  • Nam / August 17, 2014 / 11:35 AM

    @Uri: I have Spark in standalone mode cluster, and that’s why I can only have 1 SparkContext at a time. Your reference is so helpful, I will try YARN soon. Thanks!

    @Steve: you may want to try launching the ipynb server using the following:

    IPYTHON_OPTS=”notebook –pylab inline –profile pyspark” /path/to/pyspark –master

  • Hari Sekhon / August 22, 2014 / 8:53 AM

    Funny timing, I wrote a python script to handle PySpark integrated IPython Notebook setup for my users just 2 weeks back.

    A couple points looking at this blog post today:

    1. It seems to be executing in LOCAL mode rather than YARN mode
    2. –master doesn’t seem to work in PYSPARK_SUBMIT_ARGS, although it does as an arg to pyspark (using 1.0.2)

    You can clone the git repo below and run ‘ipython-notebook-pyspark.py’ to try:

    https://github.com/harisekhon/toolbox

    Regards,

    Hari Sekhon

  • Hari Sekhon / August 22, 2014 / 9:32 AM

    Ah, I see looking back at the comments I’m not the only person that has found this. If calling pyspark beware it resets PYSPARK_SUBMIT_ARGS=”".

    Nam – yes you can have multiple, I wrote the script above to give each of my users their own PySpark integrated IPython Notebook – with a password they define at prompt the first time they run the script. It writes the configs and they can see the IP and port to connect to in the output (the script tries to figure out the IP of the interface with default gateway so it’s not reporting http://0.0.0.0:8888)

    The script also supports MASTER and PYSPARK_SUBMIT_ARGS environment variables so you can override the options for local vs standalone vs yarn mode or num executors / memory / cpu cores.

    It’s about 7 secs slower to initially start on YARN due to initializing and connecting to a new Application Master. After that it’s about the same for successive requests.

    I also found a library issue with python not finding pyspark library on cluster nodes. You probably won’t notice if you’re only running on a single node sandbox vm. I written a fix/workaround for that into the script using SPARK_YARN_USER_ENV as well as a few other pathing things like YARN_CONF_DIR and globbing of available py4j-*-src.zip since that will probably change on you.

    I’d recommend just running that script to handle the setup, otherwise it’s quite tedious and tricky to get right…

    Regards,

    Hari Sekhon

  • Hari Sekhon / August 22, 2014 / 9:34 AM

    To be clear that YARN vs LOCAL mode and PYSPARK_SUBMIT_ARGS issue was solved for me by not calling pyspark and using the script I wrote to handle all the setup, tweaks and fixes.

    Regards,

    Hari Sekhon.

    • Uri Laserson (@laserson) / August 22, 2014 / 10:59 AM

      Note that in my formulation, you do NOT call pyspark. Rather, you call the regular ipython executable with the pyspark profile. This should make sure the PYSPARK_SUBMIT_ARGS is treated correctly.

Leave a comment


+ 2 = six