Hadoop World 2010 Tweet Analysis
Neil Kodner, an independent consultant, is the guest author of this post. Neil found inspiration, which spurred innovation at Hadoop World 2010 from a moments decision to capture the #hw2010 streaming Twitter feed.
During the Hadoop World 2010 keynote, a majority of attendees were typing away on their laptops as Mike Olson and Tim O’Reilly dazzled the audience. Many of these laptop-users appeared to be tweeting as the keynote was taking place. Since I have more than a passing interest in twitter, Hadoop, and text mining, I thought it would be a great idea to track and store everyone’s Hadoop World tweets.
A bit about myself: I’ve been using Hadoop for a little over a year, mostly for personal projects, on a 4-node cluster consisting of an iMac, a Mac Mini server, and two laptops – commodity hardware of the truest form. On this cluster I have nearly 800 Million Tweets and I use a combination of Pig, Hive, and Python Streaming to find interesting things within the data. Some results may be found on my site or posted to my twitter account.
During the keynote, I quickly created an Amazon Micro EC2 instance, tapped into the Twitter Streaming API, and began downloading tweets containing the hashtag #hw2010.
After filtering out a few Halloween tweets (get it? #hw2010?), about 1,500 tweets remained, respectable for a one-day event. Here are some key findings from the data:
The most prolific tweeter(twitter-er?) was ‘disruptive engineer’ Mathias Herberts (@herberts) with an astonishing 214 tweets, 14% of all #hw2010 tweets. Out of Mathias’ 214 tweets, 205 were his own tweets and 9 were retweets. The next most prolific tweeters were Ryu Kobayshi (@ryu_kobayashi)) with 54, and Chris Shain (@chrisshain) with 39 tweets.
The most-frequent tweeters, with a minimum of 10 tweets, were Yukio Uematsu (@alfyukio) whose MTBT(mean time between tweet) was 5.6 minutes over his 28 tweets. Wayne Eckerson (@weckerson) was next with a tweet every 7 minutes followed by Chris Shain with 39 tweets, just under ten minutes apart.
Christopher Gillett wrote the tweet with the greatest number of retweets(14). His tweet about all Hadoop World sessions ending with “we’re” hiring also happens to be the tweet with the greatest longevity, having had its final retweet nearly 41 hours after it was originally posted. Taking Christoper’s follower count plus the follower counts of all who retweeted his “we’re hiring” tweet, this message was potentially seen by over 15,000 people. The tweet which was retweeted the second-most number of times was Chad Metcalf (@metcalfc), quoting Cloudera CEO Mike Olson’s “You no longer have to load the gun and hand it to Oracle.”, which was retweeted 9 times and was potentially seen in over 5,000 twitter streams.
Cloudera employees seemed to be spreading the Hadoop World excitement and took the top three positions on the retweets list. Patrick Hunt (@phunt) retweeted 29 tweets, Cloudera Founder/CTO Amr Awadallah (@awadallah) retweeted 15, and Josh Patterson (@jpatanooga) sent 14 retweets.
Many of the Hadoop World 2010 tweets made reference to other users. The most frequently mentioned users were @cloudera with 58 mentions, Tim O’Reilly (@timoreilly) with 47, and Philip ‘Flip’ Kromer (@mrflip) with 32.
The hashtags that were used along with #hw2010 were no surprise – #hadoop, #hbase, and #bigdata ranked respectively as the top three #hw2010 sibling hashtags. Further down the list were #cloudera, #rstats, and #lovemyceo. ‘Hadoop’ itself was mentioned in just under a third of the tweets, ‘data’ was mentioned in nearly a quarter of the tweets, ‘Cloudera’ was mentioned in just over ten percent, and ‘Hadoop World’ came in at just under ten percent. Other frequently-mentioned topics were ‘HBase’ with 159 mentions, ‘Twitter’ had 85 mentions, ‘Facebook’ with 65, ‘Flume’ with 41, ‘Hive’ with 39, and ‘analytics’ with 22.
Moving on to the social graph of all who used the #hw2010 hashtag. I distilled all of the tweets into a list of twitter user IDs. With a huge assist by Tony Hirst (@psychemedia), who isn’t nearly as constrained by the twitter API rate limits as I am, a file was created showing all of the user to user relationships. The file, which I promptly loaded into Gephi, represented a directed network containing 411 nodes and 4,664 edges. The network’s diameter, the farthest distance between any two points, is 9 and the average path length, or average distance between any two points, was just below 3.
Out of the 411 users, 168 of them follow Cloudera, 143 follow Tim OReilly, 137 follow Cloudera’s Jeff Hammerbacher (@hackingdata), 95 follow Twitter’s Kevin Weil (@kevinweil), and 89 follow Cloudera’s Mike Olson (@mikeolson).
On the flipside, Jeff Hammerbacher follows the greatest number of #hw2010 tweeters, with 135, Ray George (@rgeorge28) follows 94, Cloudera’s Omer Trajman (@otrajman) follows 84, and Youngwoo Kim (@youngwookim) follows 76. Interestingly enough, only 4 #hw2010 tweeters follow Youngwoo Kim back, I’m assuming that’s because Youngwoo probably reads English better than the rest of us read Korean!
Jeff Hammerbacher was the overall most-connected user in terms of the social graph with an average distance to all other nodes of 1.784. Omer Trajman and Ray George were next, both having an average distance of just over 2.
Finally, in terms of importance within the network itself, Tim O’Reilly led in PageRank, the probability of arriving at a given node, followed by Cloudera, Jeff Hammerbacher, and Kevin Weil.
Based on my findings with the Hadoop World twitter stream, the event was well-received by its attendees and generated much publicity for not only Hadoop, but also for Cloudera. Based purely on the number of twitter mentions, the two most engaging sessions were the keynote featuring Mike Olson and Tim O’Reilly followed by Flip Kromer’s breakout session, “Millionfold Mashups”.
Finally, my key takeway from Hadoop World 2010 was the appreciation for how essential HBase is to the hadoop ecosystem. HBase was mentioned in more than one out of every ten tweets and I can now see why!