Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events

Categories: General Hadoop MapReduce Pig Use Case

Last month at the Web 2.0 Summit in San Francisco, Cloudera CEO Mike Olson presented some work the Cloudera Data Science Team did to analyze adverse drug events. We decided to share more detail about this project because it demonstrates how to use a variety of open-source tools – R, Gephi, and Cloudera’s Distribution Including Apache Hadoop (CDH) – to solve an old problem in a new way.

Background: Adverse Drug Events

An adverse drug event (ADE) is an unwanted or unintended reaction that results from the normal use of one or more medications. The consequences of ADEs range from mild allergic reactions to death, with one study estimating that 9.7% of adverse drug events lead to permanent disability. Another study showed that each patient who experiences an ADE remains hospitalized for an additional 1-5 days and costs the hospital up to $9,000.

Some adverse drug events are caused by drug interactions, where two or more prescription or over-the-counter (OTC) drugs taken together leads to an unexpected outcome. As the population ages and more patients are treated for multiple health conditions, the risk of ADEs from drug interactions increases. In the United States, roughly 4% of adults older than 55 are at risk for a major drug interaction.

Because clinical trials study a relatively small number of patients, both regulatory agencies and pharmaceutical companies maintain databases in order to track adverse events that occur after drugs have been approved for market. In the United States, the FDA uses the Adverse Event Reporting System (AERS), where healthcare professionals and consumers may report the details of ADEs they experienced.  The FDA makes a well-formatted sample of the reports available for download from their website, to the benefit of data scientists everywhere.


Identifying ADEs is primarily a signal detection problem: we have a collection of events, where each event has multiple attributes (in this case, the drugs the patient was taking) and multiple outcomes (the adverse reactions that the patient experienced), and we would like to understand how the attributes correlate with the outcomes. One simple technique for analyzing these relationships is a 2×2 contingency table:

For All Drugs/Reactions:

Reaction = Rj

Reaction != Rj


Drug = Di



A + B

Drug != Di



C + D


A + C

B + D

A + B + C + D


Based on the values in the cells of the tables, we can compute various measures of disproportionality to find drug-reaction pairs that occur more frequently than we would expect if they were independent.

For this project, we analyzed interactions involving multiple drugs, using a generalization of the contingency table method that is described in the paper, “Empirical bayes screening for multi-item associations” by DuMouchel and Pregibon. Their model computes a Multi-Item Gamma-Poisson Shrinkage (MGPS) estimator for each combination of drugs and outcomes, and gives us a statistically sound measure of disproportionality even if we only have a handful of observations for a particular combination of drugs. The MGPS model has been used for a variety of signal detection problems across multiple industries, such as identifying fraudulent phone calls, performing market basket analyses and analyzing defects in automobiles.

Solving the Hard Problem with Apache Hadoop

At first glance, it doesn’t seem like we would need anything beyond a laptop to analyze ADEs, since the FDA only receives about one million reports a year. But when we begin to examine these reports, we discover a problem that is similar to what happens when we attempt to teach computers to play chess: a combinatorial explosion in the number of possible drug interactions we must consider. Even restricting ourselves to analyzing pairs of drugs, there are more than 3 trillion potential drug-drug-reaction triples in the AERS dataset, and tens of millions of triples that we actually see in the data. Even including the iterative Expectation Maximization algorithm that we use to fit the MGPS model, the total runtime of our analysis is dominated by the process of counting how often the various interactions occur.

The good news is that MapReduce running on a Hadoop cluster is ideal for this problem. By creating a pipeline of MapReduce jobs to clean, aggregate, and join our data, we can parallelize the counting problem across multiple machines to achieve a linear speedup in our overall runtime. The faster runtime for each individual analysis allows us to iterate rapidly on smaller models and tackle larger problems involving more drug interactions than anyone has ever looked at before.

Visualizing the Results

The output of our analysis is a collection of drug-drug-reaction triples that have very large disproportionality scores. But as we all know, correlation is not causation. The output of our analysis provides us with useful information that should be filtered and evaluated by domain experts and used as the basis for further study using controlled experiments.

With that caveat in mind, our analysis revealed a few drug pairs with surprisingly high correlations with adverse events that did not show up in a search of the academic literature: gabapentin (a seizure medication) taken in conjunction with hydrocodone/paracetamol was correlated with memory impairment, and haloperidol in conjunction with lorazepam was correlated with the patient entering into a coma.

Even with restrictive filters applied to the drug-drug-reaction triples, we still end up with tens of thousands of triples that score high enough to merit further investigation. In addition to looking at individual triples, we can also use graph visualization tools like Gephi to explore the macro-level structure of the data. Gephi has a number of powerful layout algorithms and filtering tools that allow us to impose structure on an undifferentiated mass of data points. Here is a graph in which the vertices are drugs and the thickness of the edges represent the number of high scoring adverse reactions that feature each pair of drugs:

We can also pan and zoom to different regions of the graph and highlight clusters of drug interactions. Here is a cluster of drugs that are used in treating HIV:

A cluster of HIV-related drugs

And here is a cluster of drugs that are used to fight cancer:

A cluster of cancer-related drugs

The combination of Apache Hadoop, R, and Gephi changes the way we think about analyzing adverse drug events. Instead of focusing on a handful of outcomes, we can process all of the events in the data set at the same time. We can try out hundreds of different strategies for cleaning records, stratifying observations into clusters, and scoring drug-reaction tuples, run everything in parallel, and analyze the data at a fraction of the cost of a traditional supercomputer. We can render the results of our analyses using visualization tools that can be used by domain experts to explore relationships within our data that they might never have thought to look for. By dramatically reducing the costs of exploration and experimentation, we foster an environment that enables innovation and discovery.

Open Data, Open Analysis

This project was possible because the FDA’s Center for Drug Evaluation and Research makes a portion of their data open and available to anyone who wants to download it. In turn, we are releasing a well-commented version of the code we used to analyze that data – a mixture of Java, Pig, R, and Python – on the Cloudera github repository under the Apache License. We also contributed the most useful Pig function developed for this project, which computes approximate quantiles for a stream of numbers, to LinkedIn’s datafu library. We hope to collaborate with the community to improve the models over time and create a resource for students, researchers, and fellow data scientists.


6 responses on “Using Apache Hadoop to Find Signal in the Noise: Analyzing Adverse Drug Events