How-to: Install Apache Zeppelin on CDH

Categories: General Guest How-to Spark

Our thanks to Karthik Vadla and Abhi Basu, Big Data Solutions engineers at Intel, for permission to re-publish the following (which was originally available here).

Data science is not a new discipline. However, with the growth of big data and adoption of big data technologies, the request for better quality data has grown exponentially. Today data science is applied to every facet of life—product validation through fault prediction, genome sequence analysis, personalized medicine through population studies and Patient 360 view, credit card fraud-detection, improvement in customer experience through sentiment analysis and purchase patterns, weather forecast, detecting cyber or terrorist attacks, aircraft maintenance utilizing predictive analytics to repair critical parts before they fail, and many more. Every day, data scientists are detecting patterns in data and providing actionable insights to influence organizational changes.

The data scientist’s work broadly involves acquisition, cleanup, and analysis of data. Being a cross-functional discipline, this work involves communication, collaboration, and interaction with other individuals, internal and possibly external to your organization. This is one reason why the “notebook” features in data analysis tools are gaining popularity. They ease organizing, sharing, and interactively working with long workflows. IPython Notebook is a great example but is limited to usage of Python language. Apache Zeppelin (incubating at the time of this writing) is a new web-based notebook that enables data-driven, interactive data analytics, and visualization with the added bonus of supporting multiple languages, including Python, Scala, Spark SQL, Hive, Shell, and Markdown. Zeppelin also provides Apache Spark integration by default, making use of Spark’s fast in-memory, distributed, data processing engine to accomplish data science at lightning speed.

This post demonstrates how easy it is to install Apache Zeppelin notebook on CDH (for dev/test only, not supported). We assume familiarity with Linux (especially CentOS) commands, installation, and configuration.

System Setup and Configuration

Components

Listed below are the specs of our test Hadoop cluster.


Installed hardware


Installed software

These installation commands are specific to CentOS. If you do not login as ‘root’, you must use sudo for all the commands.

  • Update CentOS packages (yum update).
  • Install latest version of Java, preferably version 1.7 or later (yum install java-1.8.0-openjdk-devel).
  • Install Git (yum install git).
  • Install Node.js and npm (yum install nodejs npm).
  • Bower (is installed by npm).
  • Install Apache Maven – refer to these steps for installation.

Important Note: When you are working in a corporate environment, you need to set the proxies for Git, Nnpm, and Bower individually along with Maven.

Setting Proxies
  • For Git
  • For npm
  • For Bower
Building Zeppelin Binaries
  • Download and extract the latest version of Apache Zeppelin from GitHub.
  • Now cd to /incubator-zeppelin-master
  • The current versions of CDH, Hadoop, and Spark are:CDH 5.4.0

    Spark 1.3.0

    Hadoop 2.6.0

  • Maven command to build the Zeppelin (locally):

    OR

    Maven command to build the Zeppelin for YARN (All spark queries are tracked in Yarn history):

    Profiles included:

    Pspark-1.3: Installs spark framework support for Zeppelin

    Ppyspark: Installs all configurations required to run pyspark interpreter in Zeppelin Phadoop-2.6: Installs Hadoop version support for Zeppelin

Once the build is successful, continue with the configuration.

General Configuration of Zeppelin
  • To access the Hive metastore, copy the hive-site.xml from HIVE_HOME/conf into ZEPPELIN_HOME/conf folder (where HIVE_HOME and ZEPPELIN_HOME refers to the install locations of this software).
  • In ZEPPELIN_HOME/conf folder duplicate zeppelin-env.sh.template and rename it to zeppelin-env.sh.
  • In ZEPPELIN_HOME/conf folder duplicate zeppelin-site.xml.template and rename it to zeppelin-site.xml.
YARN Configuration of Zeppelin

If you have built binaries for yarn, set the master property for the Spark interpreter, i.e., master=yarn-client via Zeppelin UI (Interpreter tab)

  • In the Zeppelin /conf  directory go to the zeppelin-env.sh file, uncomment the export HADOOP_CONF_DIR and specify the configuration directory location of the yarn-site.xml file (e.g., export HADOOP_CONF_DIR =/etc/hadoop/conf).

Start Zeppelin: ./bin/zeppelin-daemon.sh start (Note: Sometimes you may not be able to run the above command. In that case, make all scripts in /bin folder executable with the following command:

chmod –R 777.)

After this, try the previous command again to start Zeppelin.

And now you can access your notebook at http://localhost:8080 or http://host.ip.address:8080.

Stop Zeppelin: ./bin/zeppelin-daemon.sh  stop

Testing

    1. Start the Zeppelin application: ./bin/zeppelin-daemon.sh start and access http://localhost:8080 (or IP address of the node it is installed on).
    2. If you already have data on the Apache Hive metastore, which is accessible via hive commands locally, let’s test Zeppelin commands. Use the %hiveinterpreter to access the Hive metastore and list all available databases. In this example we already have some public genome databases available in our Hive metastore. If you do not have any data in your Hive metastore, you may want to load some data before starting this test or skip to Step 4.Now, type these commands in notebook:

      The code snippet is echoed back and the code execution output is displayed:

    3. To display tables in a specific database, such as “wellderly”, type these commands in the notebook:

Again, the code snippet is echoed back and the code execution output is displayed:

    1. Download the test dataset (education.csv) and place it in your HDFS location. Using the Scala interpreter register a table using the .csv file in HDFS. Use the code snippet to register the table. Note: Scala interpreter is the default, so nothing needs to be specified in Zeppelin (like %hive) when using Hive.

After that, run the command below:

You now have installed and configured Zeppelin correctly and you have been able to test the installation successfully. Documentation for Zeppelin is available here.

Sharing a Notebook

    1. If you want to share these notebook results with another user, you can simply send the URL of your notebook to that user. (That user must have access to the server node and cluster on which you created your notebook). That user not only can view all your queries, but also run all your queries to view your queries’ results.
    2. If you want to share only the results without any queries (report-mode), please follow these steps:
      1. Go to right corner of the Zeppelin window, where you see a dropdown list after the settings icon.
      2. Change it from default to report. In this mode only results can be viewed without queries.
      3. Copy the URL and share with others (who have access to the server node and cluster).

      4. As the above image shows, three modes are available to share your notebooks:
        1. Default – In this mode, the notebook can be edited by anyone who has access to the notebook (edit queries and re-run to display different results).
        2. Simple – This mode is similar to default, the only difference is that all the available options are invisible. Options are visible only when you hover your mouse over the cell. This mode gives a cleaner view of the results when shared.
        3. Report – When this mode is enabled, only the final results are visible (read only). The notebook cannot be edited

Conclusion

Clearly, Apache Zeppelin is in the incubator stage, but it does show promise as a cross-platform notebook not tied to a particular platform, tool, or programming language. Our intent here was to demonstrate how you can install Apache Zeppelin on your own system and start experimenting with its many capabilities. In the future, we want to use Zeppelin for exploratory data analysis and also write more interpreters for it to improve the visualization capability, i.e., incorporate Google Charts and similar tools.

Facebooktwittergoogle_pluslinkedinmailFacebooktwittergoogle_pluslinkedinmail

10 responses on “How-to: Install Apache Zeppelin on CDH

  1. Karthik Vadla

    Hi Alex,

    Other way is to use latest version of Maven 3.3.
    That should fix this.

    Thanks
    Karthik Vadla

      1. Santhosh Nair

        You can share hive/impala metadata with spark sqlContext and execute queries against them using standard spark-sql .If you need impala queries specifically jdbc might be the choice

  2. malouke

    hello ,
    thank you for topics but i have some issues whit install
    i tried to use zeppelin on CDH5.5.1 (coporate server with 15 nodes (YARN)) i did like your propose :
    mvn clean package -Pspark-1.5 -Ppyspark -Dhadoop.version=2.6.0-cdh5.5.1 -Phadoop-2.6 -Pyarn –DskipTests (ok work!)
    and env variable
    export HADOOP_CONF_DIR=/etc/hadoop/conf:/etc/hive/conf
    export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1254.1026/bin/../lib/hadoop
    export SPARK_HOME=/opt/cloudera/parcels/CDH-5.5.1-1.cdh5.5.1.p1254.1026/lib/spark
    export MASTER=yarn-client
    export SPARK_SUBMIT_OPTIONS=”–conf spark.driver.port=54321 –conf spark.fileserver.port=54322 –conf spark.blockManager.port=54323 –deploy-mode client –master yarn –num-executors 2 –executor-memory 2g”
    export JAVA_HOME=/usr/lib/jvm/j2sdk1.7-oracle

    #Licensed to the Apache Software Foundation (ASF) under one or more
    export PYSPARK_PYTHON=/opt/cloudera/extras/python27/bin/python

    export PYSPARK_DRIVER_PYTHON=/opt/cloudera/extras/python27/bin/python

    export PATH=$PATH:/opt/cloudera/extras/python27/bin/
    on yarn manger i see Running , but only interperter i can use it s spark like that
    %spark
    sc (ok work)

    i want use pyspark for my work job with %pyspark
    i get erreor like
    %pyspark not value
    %sql same error
    please give me help
    thank for advance

  3. Brandon Strader

    The line:
    mvn clean package -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.2 -Phadoop-2.6 -Pyarn –DskipTests

    The switch on “–DskipTests” is actually a dash (longer, not on keyboards), instead of a hyphen-minus (key to the right of the 0 key).

  4. Rahul Jain

    I was having problem while using below mvn option , it was not picking up cloudera repo and hbase version was not compatible with CDH 5.4.1
    mvn clean package -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.1 -Phadoop-2.6 –DskipTests

    Everything started work after I did 3 changes
    1) Updated hbase version into hbase/pom.xml
    1.0.0-cdh5.4.1
    2.6.0-cdh5.4.1
    2) –DskipTests to -Dmaven.test.skip=true as suggested by Alex Ott
    3) To use cloudera repo included vendor-repo profile in my mvn call
    mvn clean package -Pvendor-repo -Pspark-1.3 -Ppyspark -Dhadoop.version=2.6.0-cdh5.4.1 -Phadoop-2.6 -Dmaven.test.skip=true

    Thanks
    Rahul Jain

  5. Jeff Turn

    Thanks go to Rahul for his suggested command. Similarly, using -Pvendor-repo helped in building my binaries successfully. Looking forward to Zeppelin’s maturity!