How-to: Use HUE’s Notebook App with SQL and Apache Spark for Analytics

Categories: How-to Hue Spark

This post from the HUE team about using HUE (the open source web GUI for Apache Hadoop), Apache Spark, and SQL for analytics was initially published in the HUE project’s blog.

Apache Spark is getting popular and HUE contributors are working on making it accessible to even more users. Specifically, by creating a Web interface that allows anyone with a browser to type some Spark code and execute it. A Spark submission REST API was built for this purpose and can also be leveraged by the developers.

In a previous post, we demonstrated how to use HUE’s Search app to seamlessly index and visualize trip data from Bay Area Bike Share and use Spark to supplement that analysis by adding weather data to our dashboard. In this post, we’ll use HUE’s Notebook app to study deeper the peak usage of the Bay Area Bike Share (BABS) system.

To start, download the latest data set from http://www.bayareabikeshare.com/datachallenge. This post uses the data from August 2013 through February 2014.

Importing CSV Data with the Metastore App

The BABS data set contains 4 CSVs that contain data for stations, trips, rebalancing (availability), and weather. Using HUE’s Metastore import wizard, we can easily import these data sets and create tables that infer their schema from the CSV header.

hue-spark-f1

hue-spark-f2

The import wizard also provides the opportunity to override any field names or types, which we’ll do for the Trip data to change the “duration” field from a TINYINT to an INT.

hue-spark-f3

Interactive Analysis with an Hadoop Notebook

Now that we’ve imported the data into our cluster, we can create a new Notebook to perform our data crunching. To start, we’ll run some quick exploration queries using Impala.

Let’s find the top 10 most popular start stations based on the trip data:

hue-spark-f4

Once our results are returned, we can easily visualize this data; a bar graph works nicely for a simple COUNT..GROUP BY query.

hue-spark-f5

It seems that the San Francisco Caltrain (Townsend at 4th) was by far the most common start station. Let’s determine which end stations, for trips starting from the SF Caltrain Townsend station, were the most popular. We’ll fetch the latitude and longitude coordinates so that we can visualize the results on a map.

hue-spark-f6

The map visualization indicates that the most popular trips starting from the SF Caltrain station are in fairly close proximity to the station, with most of the destinations being clustered around the Financial District and SOMA.

Long Running Queries with Hive

For longer-running SQL queries, or queries that require use of Apache Hive’s built-in functions, we can add a Hive snippet to our notebook to perform this analysis.

Let’s say we wanted to dig further into the trip data for the SF Caltrain station and find the total number of trips and average duration (in minutes) of those trips, grouped by hour.

Since the trip data stores startdate as a STRING, we’ll need to apply some string-manipulation to extract the hour within an inline SQL query. The outer query will aggregate the count of trips and the average duration.

Since this data produces several numeric dimensions of data, we can visualize the results using a scatterplot graph, with the hour as the x-axis, number of trips as the y-axis, and the average duration as the scatterplot size.

hue-spark-f7

Let’s add another Hive snippet to analyze an hour-by-hour breakdown of availability at the SF Caltrain Station:

We’ll visualize the results as a line graph, which indicates that the bike availability tends to fall starting at 6am and is regained around 6pm.

hue-spark-f8

Robust Data Analysis with PySpark

At a certain point, your data analysis may exceed the limits of relational analysis with SQL or require a more expressive, full-fledged API.

HUE’s Spark notebooks allow users to mix exploratory SQL-analysis with custom Scala, Python (pyspark), and R code that utilizes the Spark API.

For example, we can open a Pyspark snippet and load the trip data directly from the Hive warehouse and apply a sequence of filter, map, and reduceByKey operations to determine the average number of trips starting from the SF Caltrain Station:

hue-spark-f9

The video version of this tutorial is available below:

Conclusion

As you can see, HUE’s Notebook app enables easy interactive data analysis and visualizations with a powerful mix of tools. Want to know more about the Spark Notebook work, read about the Livy, the Spark REST Job server and see you at the upcoming Spark Summit in Amsterdam! The version is currently in beta and v1 is under consideration for a future CDH release; give us your thoughts in comments!

Stay tuned for a number of exciting improvements to the notebook app, and as usual feel free to comment on the hue-user list or @gethue!

facebooktwittergoogle_pluslinkedinmailfacebooktwittergoogle_pluslinkedinmail

10 responses on “How-to: Use HUE’s Notebook App with SQL and Apache Spark for Analytics

    1. Romain

      Yes, cf. above questions, the Livy sever is there since Hue 3.8/CDH5.4, but is in beta and blacklisted in CDH. 5.5 has a better version but upstream Hue is recommended until CDH5.7 (~Q1-Q2 2016).

    1. Romain

      Livy uses the same –proxy-user option as spark-submit / the other Spark shells, so each notebook will be executed with the privileges of the user. This is one of the nice advantage of Livy, it support impersonation for proper security.

Leave a Reply

Your email address will not be published. Required fields are marked *