When most people first hear about data science, it’s usually in the context of how prominent web companies work with very large data sets in order to predict clickthrough rates, make personalized recommendations, or analyze UI experiments. The solutions to these problems require expertise with statistics and machine learning, and so there is a general perception that data science is intimately tied to these fields. However, in my conversations at academic conferences and with Cloudera customers,
Part 1 of this post covered how to convert and store email messages for archival purposes using Apache Hadoop, and outlined how to perform a rudimentary search through those archives. But, let’s face it: for search to be of any real value, you need robust features and a fast response time. To accomplish this we use Solr/Lucene-type indexing capabilities on top of HDFS and MapReduce.
Before getting into indexing within Hadoop,
Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) to solve an old problem in a new way.
This post will explore a specific use case for Apache Hadoop, one that is not commonly recognized, but is gaining interest behind the scenes. It has to do with converting, storing, and searching email messages using the Hadoop platform for archival purposes.
Most of us in IT/Datacenters know the challenges behind storing years of corporate mailboxes and providing an interface for users to search them as necessary. The sheer volume of messages, the content structure and its complexity,