Grouping Related Trends with Hadoop and Hive

Categories: Community General Hadoop Hive

(guest blog post by Pete Skomoroch)

In a previous post, I outlined how to build a basic trend tracking site called with Cloudera’s Distribution for Hadoop and Hive.  TrendingTopics uses Hadoop to identify the top articles trending on Wikipedia and displays related news stories and charts.  The data powering the site was pulled from an Amazon EBS Wikipedia Public Dataset containing 8 months of hourly pageview logfiles.  In addition to the pageview logs, the EBS data volume also includes the full text content and link graph for all articles.  This post will use that link graph data to build a new feature for our site: grouping related articles together under a single “lead trend” to ensure the homepage isn’t dominated by a single news story.

Finding Related Trends Using Wikipedia Link Graph Data

For several weeks this summer, the death of Michael Jackson dominated the news and drove a large number of pageviews on Wikipedia.   The hourly data in the following chart was downloaded from during the last week in June:

This burst of interest in the Jackson family, Michael’s albums, and his medical conditions pushed most topics unrelated to MJ off the front page of our site:

Ideally, TrendingTopics would continue to show a range of stories when this type of event happens.  Automated news sites like Techmeme or Google News display clusters of rising articles across a number of topics to provide a better mix of content on the homepage.   For example, Techmeme shows a ranked list of related blog posts that link back to the lead article for each story:

We can build a simple version of this functionality with Hadoop by combining the article trend estimates we computed in the previous post with the Wikipedia link graph.  Wikipedia provides a periodic database dump which includes a file with the outgoing pagelinks for each article.  The June data dump includes a 12GB pagelink file named “enwiki-20090618-pagelinks.sql” containing ~ 13K SQL insert statements:

The first step in our data preparation uses a Hadoop Streaming job to convert the Mediawiki SQL insert format into a tab delimited text file ready for further processing with Hive.  The job uses a Python mapper called “” to convert the SQL insert statements into a tab delimited format.  The “” script uses a regular expression to extract the graph edges from each SQL statement and emits tab-delimited record if the page_id belongs to namespace 0 (this indicates the page is an article).

You can test this script at the unix command line on a single machine before running it on Hadoop.  Here we pipe a sample of the file into the mapper script and then examine the results with grep.  The first column of the resulting output is the Wikipedia page_id and the second column contains other Wikipedia article titles linked to by that page_id.

We ran this script on the entire pagelinks file using a temporary Cloudera Hadoop cluster on Amazon EC2, and saved the output of the conversion process in an Amazon S3 folder called “links”.  To work with these tab delimited link files on a new Hadoop cluster, we use distcp on the master node to pull the output data into HDFS from S3.   We also pull in an archived export of our trendingtopics “pages” data which contains the page_id -> title mapping and monthly trend score for each article:

We will use a set of Hive queries on the link graph data to generate the ranked list of top trending articles linking back to each Wikipedia.  After our data is in hdfs, we can start the Hive CLI on the Hadoop master node:

In the Hive shell, we create a “links” table and load the raw data from hdfs:

Now we have the “outlink” data for each page in Hive, but we want to reverse the keys so that we can find all trending page titles which link back to a given page_id.  We can map the original page_id’s to article titles and pull in the trend data using a Hive JOIN:

A second join is used to convert the original outlink titles to page_ids for use in our web application’s MySQL database:

The final stage of processing involves generating an array of the top trending backlink urls for each Wikipedia page_id.  Later on, we can export the entire lookup table “backlinks_reduced” to MySQL for use in the trendingtopics Rails app.  We use a Hive transform written in Python to rank the backlinks and concatenate the top 10 titles into a comma delimited text string for each page_id:

The streaming mapper and reducer we used are shown below:

To examine the results, we run a join query against the pages table to find the top trending backlinks for the Wikipedia article on “Swine Flu”:

This looks like a reasonable list of related pages.  Here are a few more trending topics from the same time period:

At this point, we have what looks like a reasonable approach for generating related links with a low amount of effort.  The Rails application can apply filters to remove titles like “August_15” from each list and display the related links around each trend.

Here are a few ideas for next steps with the Wikipedia link graph data and Hadoop:

  • Show related articles by using the backlink ids as a feature vector to compute article similarity
  • Use Pagerank to filter articles displayed in trends (see Running PageRank on Wikipedia)
  • Compute TFIDF weights for Wikipedia articles and incorporate it into the similarity calculation

If you are interested in digging deeper into the link data with Python and Hadoop, you should also check out Viraj Bhat and Jake Hofman’s talk “Cool Development Projects at Yahoo!: Automatic Tuning and Social Graph Analysis” next week at Hadoop World NYC.  I’ll also giving a talk that afternoon called “Building Data Intensive Apps: A closer look at”


6 responses on “Grouping Related Trends with Hadoop and Hive

  1. Lalit Kapoor

    Pete, this is a great post. I am glad that you organized it so well. Thanks for sharing your work. This is a really cool project to replicate if you want to learn a bit and get your hands dirty.

  2. Pete Skomoroch


    That final SELECT into backlinks_reduced took 330 seconds, which ran the streaming code on 2,445,704 rows. The example Hive selects from backlinks_reduced take around 160 seconds. For use in a web application, I would export those final key-value pairs to MySQL and index by page_id or load them into a non-RDBMS datastore. Hive is best for offline batch operations where you need to run against all the data or a certain partition.


  3. Pete Skomoroch

    Small correction found when looking at the query timings: the related links displayed are actually from another table that finds the “mutual links”, pages that are both outlinks and backlinks. That requirement gives a more narrow set of backlink pages that are also pointed to by the target page:

    SELECT x,y,z
    FROM backlinks JOIN LINKS ON (links.page_id = backlinks.page_id)
    WHERE links.pl_title = backlinks.bl_title;