Neil Kodner, an independent consultant, is the guest author of this post. Neil found inspiration, which spurred innovation at Hadoop World 2010 from a moments decision to capture the #hw2010 streaming Twitter feed.
During the Hadoop World 2010 keynote, a majority of attendees were typing away on their laptops as Mike Olson and Tim OReilly dazzled the audience. Many of these laptop-users appeared to be tweeting as the keynote was taking place. Since I have more than a passing interest in twitter, Hadoop, and text mining, I thought it would be a great idea to track and store everyone’s Hadoop World tweets.
A bit about myself: Ive been using Hadoop for a little over a year, mostly for personal projects, on a 4-node cluster consisting of an iMac, a Mac Mini server, and two laptops – commodity hardware of the truest form. On this cluster I have nearly 800 Million Tweets and I use a combination of Pig, Hive, and Python Streaming to find interesting things within the data. Some results may be found on my site or posted to my twitter account.
During the keynote, I quickly created an Amazon Micro EC2 instance, tapped into the Twitter Streaming API, and began downloading tweets containing the hashtag #hw2010.
After filtering out a few Halloween tweets (get it? #hw2010?), about 1,500 tweets remained, respectable for a one-day event. Here are some key findings from the data:
The most prolific tweeter(twitter-er?) was disruptive engineer Mathias Herberts (@herberts) with an astonishing 214 tweets, 14% of all #hw2010 tweets. Out of Mathias 214 tweets, 205 were his own tweets and 9 were retweets. The next most prolific tweeters were Ryu Kobayshi (@ryu_kobayashi)) with 54, and Chris Shain (@chrisshain) with 39 tweets.
The most-frequent tweeters, with a minimum of 10 tweets, were Yukio Uematsu (@alfyukio) whose MTBT(mean time between tweet) was 5.6 minutes over his 28 tweets. Wayne Eckerson (@weckerson) was next with a tweet every 7 minutes followed by Chris Shain with 39 tweets, just under ten minutes apart.
Christopher Gillett wrote the tweet with the greatest number of retweets(14). His tweet about all Hadoop World sessions ending with “we’re” hiring also happens to be the tweet with the greatest longevity, having had its final retweet nearly 41 hours after it was originally posted. Taking Christopers follower count plus the follower counts of all who retweeted his were hiring tweet, this message was potentially seen by over 15,000 people. The tweet which was retweeted the second-most number of times was Chad Metcalf (@metcalfc), quoting Cloudera CEO Mike Olsons You no longer have to load the gun and hand it to Oracle.”, which was retweeted 9 times and was potentially seen in over 5,000 twitter streams.
Cloudera employees seemed to be spreading the Hadoop World excitement and took the top three positions on the retweets list. Patrick Hunt (@phunt) retweeted 29 tweets, Cloudera Founder/CTO Amr Awadallah (@awadallah) retweeted 15, and Josh Patterson (@jpatanooga) sent 14 retweets.
Many of the Hadoop World 2010 tweets made reference to other users. The most frequently mentioned users were @cloudera with 58 mentions, Tim OReilly (@timoreilly) with 47, and Philip Flip Kromer (@mrflip) with 32.
The hashtags that were used along with #hw2010 were no surprise – #hadoop, #hbase, and #bigdata ranked respectively as the top three #hw2010 sibling hashtags. Further down the list were #cloudera, #rstats, and #lovemyceo. Hadoop itself was mentioned in just under a third of the tweets, data was mentioned in nearly a quarter of the tweets, Cloudera was mentioned in just over ten percent, and Hadoop World came in at just under ten percent. Other frequently-mentioned topics were HBase with 159 mentions, Twitter had 85 mentions, Facebook with 65, Flume with 41, Hive with 39, and analytics with 22.
Moving on to the social graph of all who used the #hw2010 hashtag. I distilled all of the tweets into a list of twitter user IDs. With a huge assist by Tony Hirst (@psychemedia), who isnt nearly as constrained by the twitter API rate limits as I am, a file was created showing all of the user to user relationships. The file, which I promptly loaded into Gephi, represented a directed network containing 411 nodes and 4,664 edges. The networks diameter, the farthest distance between any two points, is 9 and the average path length, or average distance between any two points, was just below 3.
Out of the 411 users, 168 of them follow Cloudera, 143 follow Tim OReilly, 137 follow Clouderas Jeff Hammerbacher (@hackingdata), 95 follow Twitters Kevin Weil (@kevinweil), and 89 follow Clouderas Mike Olson (@mikeolson).
On the flipside, Jeff Hammerbacher follows the greatest number of #hw2010 tweeters, with 135, Ray George (@rgeorge28) follows 94, Clouderas Omer Trajman (@otrajman) follows 84, and Youngwoo Kim (@youngwookim) follows 76. Interestingly enough, only 4 #hw2010 tweeters follow Youngwoo Kim back, Im assuming thats because Youngwoo probably reads English better than the rest of us read Korean!
Jeff Hammerbacher was the overall most-connected user in terms of the social graph with an average distance to all other nodes of 1.784. Omer Trajman and Ray George were next, both having an average distance of just over 2.
Finally, in terms of importance within the network itself, Tim OReilly led in PageRank, the probability of arriving at a given node, followed by Cloudera, Jeff Hammerbacher, and Kevin Weil.
Based on my findings with the Hadoop World twitter stream, the event was well-received by its attendees and generated much publicity for not only Hadoop, but also for Cloudera. Based purely on the number of twitter mentions, the two most engaging sessions were the keynote featuring Mike Olson and Tim O’Reilly followed by Flip Kromer’s breakout session, “Millionfold Mashups”.
Finally, my key takeway from Hadoop World 2010 was the appreciation for how essential HBase is to the hadoop ecosystem. HBase was mentioned in more than one out of every ten tweets and I can now see why!