This post was contributed by Jonathan Seidman from Orbitz. Jonathan is a Lead Engineer on the Intelligent Marketplace/Machine Learning team at Orbitz Worldwide . You can hear more from Jonathan at Hadoop World October 12th in NYC.
Orbitz Worldwide (NYSE:OWW) is composed of a global portfolio of online consumer travel brands including Orbitz, Cheaptickets, The Away Network, ebookers and HotelClub, Additionally, the company operates business-to-business service: Orbitz Worldwide Distribution provides third parties such as Amtrak, Delta, LAN, KLM, Air France and a number of other leading airlines hotel booking capabilities, and Orbitz for Business provides corporate travel services to a number of Fortune 100 clients. The Orbitz Worldwide sites process millions of searches and transactions every day, which not surprisingly results in hundreds of gigabytes of log data per day. Not all of that data necessarily has value, but much of it does. Unfortunately storing and processing all of that data in our existing data warehouse infrastructure is impractical because of expense and space considerations.
Apache Hadoop was selected to provide a solution to the problem of long-term storage and processing of these large quantities of un-structured and semi-structured data. We deployed our first Hadoop clusters in late 2009 running Cloudera’s Distribution for Hadoop (CDH), and in early 2010 deployed Hive to provide structure and SQL-like access to Hadoop data. In the short period of time since our initial deployment we’ve seen Hadoop rapidly adopted as a component in a wide range of applications across the organization due to its power, ease of use, and suitability for solving big data problems.
One of the applications that Hadoop facilitates is an effort to improve the hotel search results. Currently, when a user performs a hotel search on the Orbitz site the ranking of the search results returned (at least for larger markets) is influenced by a set of parameters manually tuned by an administrator. This leads to the question: can we use automation to optimize the ranking of hotels in order to increase bookings? In other words, can we identify consumer preferences in order to determine the best performing hotels to display to users, thus leading to more bookings? Further, for markets that are too small to be manually managed, can we implement a method to automatically rank hotel search results?
To answer this question, it was decided to turn to machine learning techniques, specifically using a trained classifier to determine a ranking of hotels that more closely follows consumer preferences. Performing this analysis requires having data on consumer interactions when shopping for hotels. Fortunately, we have a rich source of this session data in web analytics logs that are collected as users browse the sites. Unfortunately, although parts of this data are loaded into the data warehouse, it turned out that the specific fields we require are not loaded because of space restrictions. Our only alternative was to turn to the raw logs to extract the required fields. Just to further complicate things, the available archive of these logs only went back several days – not nearly enough data to perform the required analysis.
Hadoop of course provided a solution to the storage problem by providing a repository where we could download and archive logs. The next step was to extract the data we needed from the raw logs. We began with a set of shell and Perl scripts that were run manually to serially process logs on the local file system. This process worked fine for a while, but as the size of the data grew it was obvious that this process wouldn’t scale. Once again Hadoop provided a solution. Since we were already storing the logs in HDFS, by moving the most time-consuming portions of the data extraction into MapReduce, we were able to dramatically decrease processing time. A test run against a small subset of data showed a greater than four time improvement for the MapReduce processing vs. the scripts. Now that we’ve accumulated several terabytes of data the performance disparity would be even more dramatic, assuming we even had access to a storage system large enough to hold all of the data for manual processing.
After the data is extracted through MapReduce, we load the resulting records into a set of Hive tables. Hive allows us to perform ad hoc querying and further analysis of this data, such as:
- Obtaining useful metrics, many of which were unavailable with our existing data stores.
- Creating data exports for further analysis with R scripts, allowing us to derive more complex statistics and visualizations of our data.
- Aggregating data for import into our data warehouse for creation of new data cubes, providing analysts access to data unavailable in existing data cubes.
In addition to assisting with hotel rank optimization, a few examples of other ways Hadoop is being applied at Orbitz Worldwide are:
- Measuring page download performance: using web analytics logs as input, a set of MapReduce scripts are used to derive detailed client side performance metrics which allow us to track trends in page download times.
- Searching production logs: an effort is underway to utilize Hadoop to store and process our large volume of production logs, allowing developers and analysts to perform tasks such as troubleshooting production issues.
- Data aggregation for the data warehouse: further exploration is being done to expand the use of Hadoop and Hive as a means to aggregate previously unavailable data for import into our data warehouse, making it available for access by our existing data analysis tools.
- Cache analysis: extraction and aggregation of data to provide input to analyses intended to improve the performance of data caches utilized by our web sites.
Again, these are just a few examples of how Hadoop is being utilized at Orbitz Worldwide, and we’re still just scratching the surface. Each week seems to bring a new team with a big data challenge to be solved by Hadoop, a trend which I expect to continue as more teams discover the possibilities that Hadoop provides to store and process data.
I’d like to thank my co-workers who have all made significant contributions to the work discussed here, including Rob Lancaster, Ramesh Venkataramaiah, Wai Gen Yee, Steve Hoffman, Matt Haddock and Andrew Yates. Also a big thanks to Vice President of Technology Roger Liew, who was an early and enthusiastic champion of Hadoop.