Using Apache Hadoop to Annotate Billions of Web Documents with Semantics
Welcome to the first guest post on the Cloudera blog. The other day, we saw Toby from Swingly tweeting about using Apache Hadoop to process millions of other tweeters’ tweets. We were curious, and Toby put together a great writeup about how they use Hadoop to crunch data. We have a few other guest posts in the pipeline, but if you are doing something really fun with Hadoop and want to share, we’d love to hear from you. Get in touch! -Christophe
How can you run hundreds of memory intensive annotation tasks across billions of web documents to build a sweet semantic search engine before the sun goes nova? Use Hadoop.
Now the slightly longer version: I’m a software engineer working for a brand new start-up called Swingly. We are based in Northern Dallas and are working on creating a new semantic search engine that given a question can find answers on the open web (that includes both tons of standard web documents as well as large structured data sources!)
Oh and before I forget, standard stuff: These are my views and don’t represent the views of any company, my cat, or my father’s brother’s nephew’s cousin’s former roommate.
Now down to the details. Our goal is to build the largest searchable semantic index of web content ever created (emphasis on the semantic part there). In order to do this we’re annotating terabytes of documents with information extraction and text analysis tools. The linguistic cues extracted from this process forms the backbone of our search index. This heavy use of NLP technology on a large scale is what really sets Swingly’s approach to semantic search apart from our competition – instead of merely mining data from Wikipedia we’re actually finding and managing billions of new facts each day.
We currently extract more than 500 kinds of named entities and hundreds of types of relations, attributes, and events from unstructured text. This is no small job: While these are great resources, it takes a heck a lot of computing power to actually deploy these systems for a large scale operation! Here’s where Apache Hadoop comes in.
Our Hadoop cluster has about 100 cores and runs 24/7, so we wanted to make the best possible use of this hardware. We made some minor configuration tweaks to suit our needs. Specifically, we have adjusted the default number of map and reduce tasks, as well as set up multiple secondary name nodes for added data protection.
Our processing task consists of loading the compressed raw text data (e.g. html, pdf, etc) into the DFS, and starting a sequence of jobs over that data. Intermediary steps include such tasks as html zoning, content filtering, and named entity recognition. We use standard Hadoop sequence files throughout this process, although we have implemented our own “Writable” classes for serializing our extracted metadata.
There were a few extra considerations we had to address. For one, some of our linguistic processes are in a state of flux due to experimentation with new algorithms and various tweaking, thus some of the data would periodically need to be reprocessed. To avoid reprocessing all of our data from scratch, we are saving the intermediate output between steps in our process so that we can reprocess only a specific section and thus save valuable machine time. Another factor we had to consider was reliability. Our first runs would often crash or stall due to software problems or memory issues. We have set a number of timeouts and fault tolerance parameters throughout our Hadoop configuration code to allow tasks to gracefully fail without interrupting an entire process (e.g. using SkipBadRecords). Another, relatively recent improvement we have made was to permanently store the necessary jar files on the DFS. This became necessary because the myriad of NLP tools we are using created significant jar bloat, resulting in our job jars exceeding 500 MB in size. Needless to say such bloat added a frustrating overhead during testing and bug fixing, so we found a workaround by invoking some classloader voodoo and automatically copying the necessary jars to the DFS and using Hadoop’s distributed cache mechanism to put the jars in the classpath. There are also a few extra tweaks like computing checksums to avoid unneeded copying of jars.
With regards to my tweet about loading twitter statuses (how we hooked up with Cloudera): We’re actually using the same extraction tools we’ve deployed for the open web data to extract semantic information from the twitter stream. This effort is just getting off the ground, but we’re already finding tons of interesting trending and zeitgeist information from this process.
Let me know if there are any questions or necessary elaborations on any of this! It’s a pleasure to share our exciting work with community, and we hope to go live soon. You can find my contact details at http://www.toluju.com/