Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interactive


Tuesday, November 8th, 2011


We successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties at CBS Interactive. After I introduce Lumberjack, the Extraction,Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release, I will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, we achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).

Next Steps

Presentation Video