Using Apache Hadoop to Measure Influence
- by Binh Tran and Hiral Patel
- May 15, 2011
- 2 comments
Klout’s goal is to be the standard for influence. The advent of social media has created a huge number of measurable relationships. On Facebook, people have an average of 130 friends. On Twitter, the average number of followers range from 300+ to 1000+. With each relationship comes a different source of data. This has created A LOT of noise and an attention economy. Influence has the power to drive this attention.
When a company, brand, or person creates content, our goal is to measure the actions on that content. We want to measure every view, click, like, share, comment, retweet, mention, vote, check-in, recommendation, and so on. We want to know how influential the person who *acted* on that content is. We want to know the actual meaning of that content. And we want to know all of this, over time.
Measuring influence is a bit like trying to measure an emotion like hate or jealousy. It’s really hard and takes a boatload of data.
A huge part of what we do is develop machine learning models that make sense of this data. On top of that, there’s an endless amount of this data and we need a platform to ingest, prepare, and analyze it.
The two biggest platforms are Facebook and Twitter, but it hardly ends there when it comes to social media. There’s LinkedIn, Foursquare, Path, Youtube, Quora, and many others. This presents the challenge of creating models for each platform and building data analysis platforms that can handle unstructured data.
To handle this at Klout, we’ve turned to open source technologies such as Apache Hadoop. Specifically, we turned to Cloudera’s CDH3 distribution due to ease of installation and availability of enterprise support.
Twitter was the natural selection for our first network to analyze due to the open nature of the data as well as the simplistic nature of actions you can take on Twitter, such as a mention or a retweet.
However, as our models matured, the growth of Twitter increased. As of this post, our Twitter cluster has the following stats:
- 75 million people scored daily
- 4 billion graph edges scored daily
- 48 million people are influenced by or influence an average of 27 people
- We derive hundreds of thousands of different topics that 14 million users are influential
- On average 5 topics per user using NLP and semantic analysis
- For topics, 3 months of mentions and retweets are analyzed, currently over 6 billion
Twitter Analytics Overview
From the twitter firehose, data is written to disk in buffered chunks. A mapreduce job handles the task of preparing the firehose data into different buckets needed for each of the workflows. These different workflows serve different products from performing bot and spam detection to scoring to topic extraction.
Many of our mapreduce jobs are written in java, but we also rely on Pig Latin for some purposes such as performing simple joins are population aggregates and statistics.
Oozie is used to coordinate the different workflow components. To serve out data both internally and externally, we dump out raw csv files or load this data into HBase which interfaces with load balanced API servers.
Twitter Scoring Workflow
We use a machine learning and statistical based approach to perform our scoring. This model currently has over 35 features. The scoring workflow consists of different Oozie jobs, many of which perform feature extraction. In the final jobs of this workflow, all the features are fed into the scoring model, which produces scores.
We’ve experimented with Mahout in the past and we will be using more of it in the future.
Having a highly available API is one of our key goals. However, when we refresh 75 million scores + meta data daily, it becomes challenging to flip a switch to make all the new data available. This led to us having multiple clusters. When one cluster is loading data, the load balanced API servers are aware of each cluster’s status, and switches to the non-loading cluster. This also mitigates any performance issues due to splits, minor and major compactions on the clusters. This also allowed us to cope with instabilities caused by cloud instances in unpredictable states.
That said, we are in the process of building out our own servers and racks at a nearby facility. We’ve also had issues where our edits logs for namenodes get corrupted due to server instabilities. This is where Cloudera has come to our rescue. We initially had to manually apply patches and build hadoop-core jars ourselves to resolve such problems, but with Cloudera’s Distribution including Apache Hadoop and their expert Solution Architects help this is no longer an issue. We now are able to focus our resources on our products.