Demo: Analyzing Data with Hue and Hive

In the first installment of the demo series about Hue — the open source Web UI that makes Apache Hadoop easier to use — you learned how file operations are simplified via the File Browser application. In this installment, we’ll focus on analyzing data with Hue, using Apache Hive via Hue’s Beeswax and Catalog applications (based on Hue 2.3 and later).

The Yelp Dataset Challenge provides a good use case. This post explains, through a video and tutorial, how you can get started doing some analysis and exploration of Yelp data with Hue. The goal is to find the coolest restaurants in Phoenix!

Dataset Challenge with Hue

The demo below demonstrates how the “business” and “review” datasets are cleaned and then converted to a Hive table before being queried with SQL.

 

Now, let’s step through a tutorial based on this demo. The queries and scripts are available on GitHub.

Getting Started & Normalization

First, get the dataset from the Yelp Challenge webpage. Then, clean the data using this script.

  1. Retrieve the data and extract it.

     

  2. Convert it to TSV.

     

  3. The following column headers will be printed by the above script.

     

Create the Tables

Next, create the Hive tables with the “Create a new table from a file” screen in the Catalog app or Beeswax “Tables” tab.


Creating a new table

Upload the data files yelp_academic_dataset_business_clean.json and yelp_academic_dataset_review_clean.json. Hue will then guess the tab separator and then lets you name each column of the tables. (Tip: in Hue 2.3, you can paste the column names in directly.)


Naming columns

You can then see the table and browse it.


Browsing the table

Queries

Open up Hue’s Hive editor (Beeswax) and run one of these queries:

Top 25: business with most of the reviews

 

Top 25: coolest restaurants

 


Query editor with SQL syntax highlighting and auto-complete
 


Watch the query runs
 


See the results with an infinite scroll

 

Now let your imagination run wild and execute some of your own queries!

Note: This demo is about doing some quick data analytics and exploration. Running more machine learning oriented jobs like the Yelp Examples would deserve a separate blog post on how to run MrJob. Hue users would need to create an Apache Oozie workflow with a Shell action (see below). Notice that a ‘mapred’ user would need to be created first in the User Admin.

Running MrJob Wordcount example in the Oozie app with a Shell action

What’s Next

As you can see, getting started with data analysis is simple with the interactive Hive query editor and Table browser in Hue.

Moreover, all the SELECT queries can also be performed in Hue’s Cloudera Impala application for a real-time experience. Obviously, you would need more data than the sample for doing a fair comparison but the improved interactivity is noticeable.

In upcoming episodes, you’ll see how to use Apache Pig for doing a similar data analysis, and how Oozie can glue everything together in schedulable workflows.

Thank you for watching and hurry up, only one month before the end of the Yelp contest!

Romain Rigaux is a Software Engineer working on the Platform team.

Filed under:

8 Responses
  • epb / September 06, 2013 / 7:32 AM

    Is this still working with the current versions? In the query

    SELECT name, review_count
    FROM business
    ORDER BY review_count DESC
    LIMIT 25

    the “order by” modifier had no effect (results returned in random order). I used all the current versions of CDH, Hive, Cloudera Manager, etc. on an EC2 instance following the above instructions.

  • Romain Rigaux / September 06, 2013 / 8:03 AM

    Yes this is working.

    Could you share the result page?
    Also, SELECT * from business?

    Maybe coninue the discusion on http://groups.google.com/a/cloudera.org/group/hue-user

  • Koh / October 01, 2013 / 1:58 AM

    Hi,

    I encountered 2 issues while following the above example.

    1. When uploading the yelp_academic_dataset_review_clean.json file, it was renamed with a .tmp extension. As such i could not successfully create the table. Is it due to the file size? Mine is about 170mb.

    2. When performing the query Top 25: business with most of the reviews, the result i had were all null values

    Pls advise.

    Thanks you!

  • Koh / October 01, 2013 / 2:00 AM

    Just to add,

    I’m using the cloudera quickstart vm 4.3.0

  • Romain Rigaux / October 01, 2013 / 8:57 AM

    1. I think it is a knonw issue on the VM. The current workaround is to upload it with the ‘hadoop fs -put’ command: https://groups.google.com/a/cloudera.org/forum/#!searchin/hue-user/upload$20tmp

    2. This is probably due to this bug: https://issues.cloudera.org/browse/HUE-1303. A simple 2-line fix is provided. Could you try it?

    In the meantime I am looking when the CDH4.4 VM will be released.

  • Rengarajan / October 04, 2013 / 11:13 PM

    Hi,

    I could not successfully install Hue with all the latest from Apache website. Although, I use hadoop 1.x.x, I can run rest of the eco system, such Hbase, hive etc. This is possibly due to python related issues.

    Since the demo illustrates the power of hive as well, I used Hive shell to continue. Here are
    some observations.

    1) Business table creation and the loading of the data followed by the execution of the query was smooth :)

    2) While creating the “review” table I followed the list produced by the “convert.py” and the table creation failed. A little reading of the possible keywords of Hive showed that “date” is a reserved keyword and caused the failure. Once the field name was changed, I could create the table.

    Loaded the data successfly.

    However, execution of the query resulted in an error message. Following is the CLI output. Any idea how to fix this.

    SELECT r.business_id, name, SUM(cool) AS coolness FROM review r JOIN business b ON (r.business_id = b.business_id) WHERE categories LIKE ‘%Restaurants%’ GROUP BY r.business_id, name ORDER BY coolness DESC LIMIT 25;
    FAILED: SemanticException [Error 10014]: Line 1:121 Wrong arguments ”%Restaurants%”: No matching method for class org.apache.hadoop.hive.ql.udf.UDFLike with (array, string). Possible choices: _FUNC_(string, string)

    thanks
    Renga

  • Rengarajan / October 05, 2013 / 12:30 AM

    Hi,

    My bad. Some level of debugging and googling pointed to the definition of business table. I had defined the “categories” field as ARRAY instead of STRING.

    Once I created “business” table with the modified definition (after dropping the existing table) and loaded the data, the join query started working just fine. It took couple of minutes for 3 jobs and displayed the desired result at the end.

    Sorry to have bothered. Anyway thanks.

    Next task. Let me fix the “hue” installation issues and try the same using hue.

  • Romain Rigaux / October 07, 2013 / 12:29 PM

    Glad to hear and let us know how it goes!

Leave a comment


+ seven = 9