We at Cloudera are tremendously excited by the power of data to effect large-scale change in the healthcare industry. Many of the projects that our data science team worked on in the past year originated as data-intensive problems in healthcare, such as analyzing adverse drug events and constructing case-control studies. Last summer, we announced that our Chief Scientist Jeff Hammerbacher would be collaborating with the Mt. Sinai School of Medicine to leverage large-scale data analysis with Apache Hadoop for the treatment and prevention of disease. And next week, it will be my great pleasure to host a panel of data scientists and researchers at the Strata Rx Conference (register with discount code SHARON for 25% off) to discuss the meaningful use of natural language processing in clinical care.
Of course, the cost-effective storage and analysis of massive quantities of text is one of Hadoop’s strengths, and Jimmy Lin’s book on text processing is an excellent way to learn how to think in MapReduce. But a close study of how the applications of natural language processing technology in healthcare have evolved over the last few years is instructive for anyone who wants to understand how to use data science in order to tackle seemingly intractable problems.
Lesson 1: Choose the Right Problem
- Collect a lot of dirty, unstructured data.
- Hire a data scientist.
In general, I am wary of people who come to me bearing databases and asking for “insights” into their data. The right way to approach data science is to start with a problem that has a bottom-line impact on your business, and then work backward from the problem towards the analysis and the data is needed to solve it. Insights don’t happen in a vacuum – they come with the hard work of analyzing data and building models to solve real problems.
Sometimes, the link between the business problem and the application of data science will be very clear, like in the case of correctly identifying fraudulent credit card transactions. In other cases, there can be multiple steps that separate the business problem from the data science application. For example, a rental service like Netflix is primarily interested in growing and retaining their subscribers. They could have performed an analysis that demonstrated a correlation between the number of movies in a customer’s queue and the probability that the customer will renew his subscription. This analysis might have then motivated Netflix to create a movie recommendation system that helps customers discover movies that they will love. If the users who add recommended movies to their queues are then more likely to renew their subscriptions, then the project has succeeded.
In the case of natural language processing and healthcare, the right problem turned out to be computer-assisted coding (CAC). In healthcare, coding refers to the process of converting the narrative description of the treatments a patient received, including doctors’ notes, lab tests, and medical images, into a series of alphanumeric codes that are used for billing purposes. Medical coding is both very important (if the treatments a patient receives aren’t coded, the insurance company won’t pay for them) and very expensive (medical coders need a lot of training and skill to do the job well). To make matters worse, the coding standards are becoming more complex: the current ICD-9 standard has around 24,000 possible codes, while the next-generation ICD-10 standard will expand to 155,000 codes. Finding ways to use natural language processing to help coders be more efficient is a great problem for a data scientist to tackle: it has a quantifiable impact on the bottom line and there is a strong potential for data analysis and modeling to make a meaningful difference.
Lesson 2: Build a Minimum Viable Data Product
The minimum viable product strategy is also a good way of developing data products: the first model that we use for a problem does not need to crunch massive quantities of data or leverage the most advanced machine learning algorithms. Our primary objective is to create the simplest system that will provide enough utility for its users that they are willing to use it and start providing feedback that we can use to make the product better. At first, this feedback may be explicitly communicated to the data scientists working on the system, who may incorporate the feedback by tuning the system by hand. But if we design the system well, we can use automated and implicit sources of feedback to make improvements in a more scalable fashion.
The first systems for performing computer-assisted coding were similar to the first spam classifiers: they relied almost exclusively on a static set of rules in order to make coding decisions. They also primarily targeted medical coding applications for outpatient treatments, instead of the more complex coding required for inpatient treatments. These early systems weren’t particularly great, but they were useful enough that they could gather feedback from the medical coders on when they failed to identify a code or included one that was not relevant for the problem, and as more data was gathered, the static rules could be augmented with statistical models that were capable of adjusting to new information and improving over time.
Lesson 3: The One Who Has the Best Data Wins
Data is like any other kind of capital – it flows to where it is wanted, and it stays where it is well-treated. Good algorithms and good people are critical for any data science project, but there is absolutely no substitute for high-quality data that you can use as inputs for your models. As your models improve, they get used more often to make decisions, receive even more feedback, and are used in a wider variety of situations, which leads to a virtuous cycle and the kind of network effects that we see in winner-take-all markets.
Many of the computer-assisted coding products that are available today are web-based and/or integrated with electronic health record (EHR) systems, which allows them to collect feedback data quickly and reliably as well as take advantage of more information about the patient to improve the automated coding. It also becomes possible to use the feedback from many different medical coders across different healthcare institutions in order to make improvements in the underlying models more quickly.
Data as Platform
For many problems that can be tackled using machine learning, the choice of input features is the most important part of the overall process. Data scientists bridge the gap between messy, unstructured data and the structured inputs required by our algorithms. At scale, the skills required to generate input features are similar to the ones needed to build ETL pipelines for data warehousing applications. You might say that ETL is the wax-on, wax-off of data science.
One of the reasons that automatic medical coding is such a great problem for data scientists to take on is that solving it well doesn’t just save money and time, it also provides the structured information that we need as inputs for other problems, including the adverse drug event and case-control projects that we have worked on here at Cloudera. We hope that you can join us at Strata Rx next week to join the conversation around how to effect change in healthcare via the effective, meaningful use of data.