San Francisco seems to be having an unusually high number of flu cases/searches this April, and the Cloudera Data Science Team has been hit pretty hard. Our normal activities (working on Crunch, speaking at conferences, finagling a job with the San Francisco Giants) have taken a back seat to bed rest, throat lozenges, and consuming massive quantities of orange juice. But this bit of downtime also gave us an opportunity to focus on solving a large-scale data science problem that helps some of the people who help humanity the most: epidemiologists.

## Case-Control Studies

A case-control study is a type of observational study in which a researcher attempts to identify the factors that contribute to a medical condition by comparing a set of subjects who have that condition (the ‘cases’) to a set of subjects who do not have the condition, but otherwise resemble the case subjects (the ‘controls’). They are useful for exploratory analysis because they are relatively cheap to perform, and have led to many important discoveries- most famously, the link between smoking and lung cancer.

Epidemiologists and other researchers now have access to data sets that contain tens of millions of anonymized patient records. Tens of thousands of these patient records may include a particular disease that a researcher would like to analyze. In order to find enough unique control subjects for each case subject, a researcher may need to execute tens of thousands of queries against a database of patient records, and I have spoken to researchers who spend *days* performing this laborious task. Although they would like to parallelize these queries across multiple machines, there is a constraint that makes this problem a bit more interesting: each control subject may only be matched with at most one case subject. If we parallelize the queries across the case subjects, we need to check to be sure that we didn’t assign a control subject to multiple cases. If we parallelize the queries across the control subjects, we need to be sure that each case subject ends up with a sufficient number of control subjects. In either case, we still need to query the data an arbitrary number of times to ensure that the matching of cases and controls we come up with is feasible, let alone optimal.

Designing and analyzing a case-control study is a problem for a statistician. *Constructing* a case-control study is a problem for a data scientist.

## Applied Auction Theory

We can think of constructing a case-control study as an assignment problem: we have a bipartite graph, where one set of nodes represents the cases, one set of nodes represents the controls, and the edges between the cases and controls are weighted by the quality of the match between the subjects as determined by the researcher. If a particular case-control pair would not be a suitable match under any circumstances because the patients are not similar enough, there is no edge between them.

Although MapReduce is great for finding compatible case-control pairs and computing the weights we want to assign to those matches, it’s not ideal for the kinds of iterative, graph-based computations that we need to do in order to solve the assignment problem. After we use MapReduce to prepare the input, we turn to Apache Giraph, a Java library that makes it easy to perform fast, distributed graph processing on Apache Hadoop clusters, to assign cases to controls.

Although there are lots of different algorithms for solving the assignment problem, our implementation is based on Bertsekas‘ auction algorithm. The core idea of the algorithm is that the case subjects will bid for the right to be matched with control subjects over a series of rounds, with the bids computed based on the edge weights. Assuming that all of the weights are integers, the auction algorithm is guaranteed to converge to an assignment of cases to controls that maximizes the sum of the weights of the matched pairs. Bertsekas’ algorithm is also very easy to parallelize, and has excellent performance on assignment problems that are relatively sparse (i.e., each node is only connected to a small fraction of the total nodes.)

## Do It Yourself

Our toolkit for constructing case-control studies is available on Cloudera’s github repository, and is released under the Apache License. To get started, you will need a cluster that has Apache Zookeeper installed, which is easy to do on local servers using the free edition of Cloudera Manager, or in a cloud environment via the version of Apache Whirr in CDH3. If you are just getting started with Hadoop and run into any issues, Cloudera Support is happy to help.

This work, like a lot of the work we do, started out as a conversation with a Cloudera customer about a challenge they were facing. If you have a data problem, if no one else can help, and if you can provide chicken soup, maybe you can hire the Cloudera Data Science Team.

Pingback: The many faces of statistics/data science: Can’t we all just get along and learn from each other? « Stat Bandit