Demo: Analyzing Data with Hue and Hive
In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).
The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!
Dataset Challenge with Hue
The demo below demonstrates how the “business” and “review” datasets are cleaned and then converted to a Hive table before being queried with SQL.
Now, let’s step through a tutorial based on this demo. The queries and scripts are available on GitHub.
Getting Started & Normalization
- Retrieve the data and extract it.
tar -xvf yelp_phoenix_academic_dataset.tar cd yelp_phoenix_academic_dataset wget https://raw.github.com/romainr/yelp-data-analysis/master/convert.py yelp_phoenix_academic_dataset$ ls convert.py notes.txt READ_FIRST-Phoenix_Academic_Dataset_Agreement-3-11-13.pdf yelp_academic_dataset_business.json yelp_academic_dataset_checkin.json yelp_academic_dataset_review.json yelp_academic_dataset_user.json
- Convert it to TSV.
chmod +x convert.py ./convert.py
- The following column headers will be printed by the above script.
["city", "review_count", "name", "neighborhoods", "type", "business_id", "full_address", "state", "longitude", "stars", "latitude", "open", "categories"] ["funny", "useful", "cool", "user_id", "review_id", "text", "business_id", "stars", "date", "type"]
Create the Tables
Next, create the Hive tables with the “Create a new table from a file” screen in the Catalog app or Beeswax “Tables” tab.
Upload the data files yelp_academic_dataset_business_clean.json and yelp_academic_dataset_review_clean.json. Hue will then guess the tab separator and then lets you name each column of the tables. (Tip: in Hue 2.3, you can paste the column names in directly.)
You can then see the table and browse it.
Open up Hue’s Hive editor (Beeswax) and run one of these queries:
Top 25: business with most of the reviews
SELECT name, review_count FROM business ORDER BY review_count DESC LIMIT 25
Top 25: coolest restaurants
SELECT r.business_id, name, SUM(cool) AS coolness FROM review r JOIN business b ON (r.business_id = b.business_id) WHERE categories LIKE '%Restaurants%' GROUP BY r.business_id, name ORDER BY coolness DESC LIMIT 25
Now let your imagination run wild and execute some of your own queries!
Note: This demo is about doing some quick data analytics and exploration. Running more machine learning oriented jobs like the Yelp Examples would deserve a separate blog post on how to run MrJob. Hue users would need to create an Apache Oozie workflow with a Shell action (see below). Notice that a ‘mapred’ user would need to be created first in the User Admin.
Running MrJob Wordcount example in the Oozie app with a Shell action
As you can see, getting started with data analysis is simple with the interactive Hive query editor and Table browser in Hue.
Moreover, all the
SELECT queries can also be performed in Hue’s Cloudera Impala application for a real-time experience. Obviously, you would need more data than the sample for doing a fair comparison but the improved interactivity is noticeable.
In upcoming episodes, you’ll see how to use Apache Pig for doing a similar data analysis, and how Oozie can glue everything together in schedulable workflows.
Thank you for watching and hurry up, only one month before the end of the Yelp contest!
Romain Rigaux is a Software Engineer working on the Platform team.