Evolution of Hadoop Ecosystem: AOL Advertising Experience

Categories: CDH Data Ingestion General Guest Use Case

Pero works on research and development in new technologies for online advertising at Aol Advertising R&D in Palo Alto. Over the past 4 years he has been the Chief Architect of R&D distributed ecosystem comprising more than thousand nodes in multiple data centers. He also led large-scale contextual analysis, segmentation and machine learning efforts at AOL, Yahoo and Cadence Design Systems and published patents and research papers in these areas.

A critical premise for success of online advertising networks is to successfully collect, organize, analyze and use large volumes of data for decision making. Given the nature of their online orientation and dynamics, it is critical that these processes be automated to the largest extent possible.

Specifically, the success of advertising technology and its impact on revenue are directly proportional to its capability to use large amounts of data in order to compute proper impression value given the unique circumstances of ad serving events such as the characteristics of the impression, the ad, and the user as well as the content and context. As a general rule, more data results in more accurate predictions.

In addition, to Optimization, Reporting and Analytics provide indispensable feedback to our internal Business and Sales teams helping us acquire new, and expand current, commitments from external customers.

At AOL, we started large-scale data collection more than 4 years ago and went from using heavily sampled data sets to being able to process full serving logs. We have been using Apache Hadoop since version 0.14 as a part of an R&D effort and recently moved to Cloudera CDH3 distribution. Gradually, we introduced more systems and technologies to our ecosystem around Hadoop.

We chose Hadoop for several reasons:

  • Ability to store, organize and process large data sets
  • Great flexibility with data formats
  • Map-reduce offers flexible data processing paradigm and works well with changing data
  • Excellent cost-volume/price-performance point which proved very important in early proof-of-concept stages
  • Failure built into the system via distributed computation and data redundancy
Line Graph Demonstrating AOL's Cluster Size [nodes] and Aggregate Disk Space [TB]

Figure 1. Growth of Hadoop cluster

AOL's Sampling Rate

Figure 2. Growth in sampling rate

We show growth of our Hadoop clusters in Figure 1, and increase in the sampling rate in Figure 2. Between the 3rd and 4th iteration we switched to disks that are 4 times larger and we used 4-8 times more cores per node. The increase in the total number of CPUs was even more pronounced as we found we needed more processing power for newly developed processing flows. During the initial stages growing the sampling rate was the primary goal. As the number of processing pipelines increased, the output data volume increased. We’ve also added more external data flows. These two trends drove the increase in total storage space and processing power beyond full log samples between stages 4 and 5. Note that the impact of important factors like the business environment and team growth had significant impact on the pace of cluster upgrades.

At the same time, we grew the ecosystem around Hadoop to encompass other infrastructure and computational components such as databases, caching and high-performance computing clusters. As our Hadoop clusters increased in size, these clusters correspondingly increased to store and process larger data sets.

The main reason for the qualitative change in shifting between the 3rd and 4th iteration was the move from R&D to a production environment. With involvement of additional teams we faced several challenges that Cloudera helped us with:

  • Specifying and executing operational requirements
  • Cluster setup
  • Staff training
  • Introducing other indispensable parts of Hadoop ecosystem such as robust data flows (Flume), monitoring and instrumentation
  • Ensuring that long-term vision and execution are aligned with Hadoop roadmap

The last point is especially important as we see Hadoop as an ever-evolving data processing platform. We see ourselves as a contributor and partner in this process – through the recently introduced Cloudera Customer Council we participate in discussions and working groups. For us, this is a great learning experience which simultaneously provides ample opportunities for us to contribute to an important technology that is changing the way we do business.