A Summer Internship with Cloudera
As the summer comes to a close it is time to thank and congratulate all of our summer interns for all their hard-work and contributions. This post was written by Lisa Chen, as she describes her Cloudera experience, and the development of Sledgehammer.
When I first came to Cloudera, I only had a very vague notion of what I’d be working on during my eleven week internship. Alex Loddengard, who was my mentor and guidance throughout the internship, had told me early on that I’d probably be building a tool used for community status reporting and lead generation. The details of what I would be implementing was left up to me to decide.
So I began asking around within the company, for what people might like to see. I quickly learned that Jeff Hammerbacher, a co-founder of Cloudera, sends emails regularly to the sales team regarding potential customer leads at all hours of the day. The general consensus was that, it would be really cool if I could build something that automates some of what Jeff does, by crawling a few of the sites that he frequents. (Of course, nothing I could build in 11 weeks or so would even come close to the extent of the work Jeff actually does, but I could try, right?)
From there, I had some idea of what I wanted to build, and next I needed to figure out how to build it. In the first two weeks of my internship, I attended two training sessions on Hadoop and development by Sarah, read about half of Tom’s book, and learned how to use git, Hadoop, Flume, and write map-reduce code.
With all the new knowledge I learned, I spent the next several weeks implementing the first stages of the project, until I eventually had something that could run without crashing every other time, and I could submit my code for review. When reviews came back to me, I made changes as needed, committed the code, and moved onto something else. Most of the time, it meant back to step one: figuring out what I wanted to build next. It was like an infinite cycle, and even now, Sledgehammer still has a long ways to go to maximize it’s full potential.
So, what does Sledgehammer actually do? Well, right now it polls data from the Apache Hadoop and Hbase archives and sub-archives, and outputs a list of potential customers in comma separated value format, in order of potential. And this potential is based off a scoring system that I’ll go into with more depth later. The results list is integrate-able with Sales Force. It also creates two other reports, one each for mbox files (from the Apache mailing lists), and twitter. All these reports together provide a rich mass of data that can be back-referenced, or used for other purposes. It’s really easy to look up more information about someone if you want to know more about them by googling some part or all of their provided info.
Scoring is based off of whatever the user provides, and how active they are where, and the age of their activity. For example, if someone uses their company email address to send mail to the relevant Apache mailing lists, they’ll get +50 points for providing company information. Someone that uses twitter gets +15 points for each of the following fields, if provided : biography, website, location. If someone has both a company name and a twitter username, that means that this person is active in both places, and gets a huge bonus of 2500 points, to propel them higher on the priority list. A person also gets 3 points for every tweet, and 3 for each mail sent, and the total of everything is the final score that they are ranked by. Of course right now, the results are far from ideal. HDFS has data from 2007 until the present from the Apache mailing lists, and from Twitter, it only has data from about five weeks ago, when I learned how to use Flume. That means right now priority is skewed heavily in favor of mailing list users, but over time it should even out, and more people will have a more complete information set in the final report.
Now that I’ve covered the basics of Sledgehammer and what it does, I’ll elaborate more on the technical parts of the project. Sledgehammer currently basically works in four distinct phases: locate a source, retrieve data into HDFS, parse and process the retrieved data, and output the results in a consumable format. I’ve currently written 19 JAVA source files for it up to date, and the entire project can be built from the root directory of the repository with ant. In total, the project consists of 3 map-reduce jobs, and one custom made crawler that fetches mbox files, and Twitter feeds are fetched with Flume. Sledgehammer is currently running on the customers.sf.cloudera.com cluster, which is also accessible through HUE.
The MboxCrawler that I wrote for Sledgehammer is a tool that finds and downloads all relevant mbox archives to a location I specify in HDFS. It begins at the http://mail-archives.apache.org/mod_mbox/, the general mailing list archive home page, where it picks out and builds the URLs of Hadoop and Hbase related archives and sub-archives. From there it polite queries the source code of each of the sub-archives into a String object, and parses the String to build the URL for the actual .mbox file for each archive Finally it downloads the source of the .mbox files into HDFS. The MboxCrawler avoids redundancy each run-through, by comparing the size of an existing file in HDFS with the size of the new file it may or may not want to pull, and if the new file is larger, the crawler will replace the existing file with the new one. Otherwise, it just skips it.
Twitter is polled much more simply. All I had to do, was figure out which tags I wanted (and I picked Hadoop, Hbase, and Cloudera), and configure Flume to poll from their RSS feed. Flume is amazing! All I had to do was specify the source sink, which was the URL of the twitter RSS feed, and the collector sink, which was HDFS, and it automatically pulled the JSON file straight into HDFS for me. I then set up a cron job, to have Flume fetch the feeds for me hourly, automatically.
After I had all the data I wanted collected, it was time to figure out how to pick out only what I wanted, and discard everything else. I began writing map-reduce jobs to do this, and in the process, ran into various problems that would inevitably end up crashing my job. One of the biggest obstacles in my way, was figuring out how to ideally parse mbox files. For some reason, every other Apache mailing list is structured slightly differently, and sometimes two emails in the same mbox file could be formatted differently. This made parsing a nightmare. I ended up writing code that would parse almost every type of possible format, and if any one email tried to crash the job, that email would simply be caught and discarded, and would increase a ‘skipped’ counter. The end result worked out pretty well, and while about 1% of the results still end up with failed formatting, it wasn’t consequential enough to dwell on.
For Twitter, I originally thought that all I had to do was use the JSON parsers that were readily available to me, and I’d be done. Unfortunately that wasn’t the case. JSON escaping isn’t exactly the greatest thing ever, and every time I ran into an out of place curly bracket, the entire job would crash. What I ended up doing was completely removing all unwanted curly brackets, at the risk of formatting failure, and ignoring the tweets that failed.
That wasn’t the end of fetching information from Twitter however. From the JSON files that Flume pulled for me, I could get a user’s username and tweets, but not their profile information, which was the most valuable to me. I also couldn’t fetch their profiles in the map-reduce process, because hitting Twitter with 10 queries simultaneously per second was probably going to get me banned from their site. To implement polite querying, I had the map-reduce job write out a list of usernames and their respective tweets, and when the job was done, read them into two different queues. I then built each user’s profile URL from their username, and, with a timed delay between queries, fetched each user’s profile separately, parsed it, and queued it into a third queue. Finally when all three of my queues were full, I had the MR driver dequeue each item into a report in HDFS. I should probably also note here, that map-reduce really dislikes non-English characters. I was forced to remove all users with non-English characters anywhere in their profile or tweets.
With preliminary reports generated, and ready to go, I sought out more feedback for next steps. I had about two weeks left in my internship at this point, and that was still plenty of time to add cool, additional features on top of what I already had. After hearing from several people, I decided to merge the results from the two separate sources into one final output report.
I wrote a third map-reduce job to parse both mbox and twitter reports, and which wrote the information into a custom writable. The reduce part of my MR job then combined all the mapped custom writables into one, which then generated a score when asked to, based off the information it had, and wrote a .csv file to HDFS.
The last step was to get the recently created .csv file to sort itself in order of score. I wrote a custom JAVA bean object that separated the score for each line/person from everything else (which got stored in a String), and wrote a custom CompareTo() method for the bean. Arrays.sort() then took care of everything else, and I had the merge map-reduce job’s driver write the new, sorted array back out to HDFS, and that’s about all that Sledgehammer currently does.
There are so many unimplemented possibilities that, had I more time, would definitely try to add to Sledgehammer. Amongst all the feedback I got, these things stood out to me the most. I think it’d be really nice to have a third source, from blogs, also feeding the Sledgehammer information database. I’d also like to see it automatically cross-referencing Google or Linked-In, or maybe even both. A cool, non-lead generation related add-on could be to use the existing data and run a status/activity analysis tool through it and generate a report related to that. Also, sorting the current output by company activity might also be something that could be looked into. There’s just so much data, all in one place, that people could do really cool things with it – possibly even at a hackathon.
But anyway, winding down, I should probably mention that Sledgehammer is far from perfect right now, and even the initial setup process could use a little more work. Installation and deployment setup varies from system to system, and the provided README may not cover every possible installation issue. There are also several minor inconveniences required of the user, before everything will run properly, and those are specified in the README.
All in all, I had a great time working on this project, and with all the brilliant people I’ve had the pleasure to meet at Cloudera. It was an amazing experience, and I can honestly say that I learned so much more than I’d ever expected to, and loved every minute of it.