We successfully adopted Hadoop as the web analytics platform, processing one Billion weblogs daily from hundreds of web site properties at CBS Interactive. After I introduce Lumberjack, the Extraction,Transformation and Loading framework we built based on python and streaming, which is under review for Open-Source release, I will talk about web metrics processing on Hadoop, focusing on weblog harvesting, parsing, dimension look-up, sessionization, and loading into a database. Since migrating processing from a proprietary platform to Hadoop, we achieved robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far).
To view the resource, please fill out this registration form.
We believe strongly in user privacy.