I met Matthew in New York City about a year ago. We sat in a private conference room and he told me the story of his pharma startup. A small group of researchers set out to solve the black-box enigma of certain kinds of vicious cancers. There are so many cancers, so their vision was to focus on especially heinous ones. Fast forward to their recent FDA approval of their “Hail Mary” procedure and treatment methodology for stage-four patients of a particular cancer. Almost in the same second, a big pharma acquired them.
From the outside and from a business perspective, this is a great success story. However, as they continue finding treatments and understanding the progression of these cancers, they now also need to serve a much higher expectation on delivery to market. I remember Matthew’s face showing mixed feelings when he explained how the pressure grew exponentially overnight.
Challenges Ahead
The challenges Matthew and his team are facing are mainly about access to a multitude of data sets, of various types and sources, with ease and ad-hoc, and their ability to deliver data-driven and confident outcomes.
Most of their research data is unstructured and has a lot of variety. For example, it includes patient samples (blood, blood pressure, temperature, and more), patient information (age, gender, where they have lived, family situation, and other details), and treatment history, most of which is currently found only in paper documents. In addition, they need to capture and store trial patient interview data, also often found in paper format. In the future, the team is thinking about at-home devices that could collect even more information, directly from patients, and on a day-to-day basis. The core challenge is this: how can Matthew, with a much larger organization, enable stakeholders and researchers to quickly access relevant data and analyze diverse data together? How can they accelerate and with ease share insights, prevent duplication of projects or research efforts, and support continued, expedited, collaboration?
In any pharma, one of the largest data problems is variety, and it has been unsolved for the last 11 years, because:
- Sample and treatment history data is mostly structured, using analytics engines that use well-known, standard SQL
- Interview notes, patient information, and treatment history is a mixed set of semi-structured and unstructured data, often only accessed using proprietary, or less known, techniques and languages
- Anonymized patient data needs to be matched across a multitude of data sources for a 360-degree understanding
- Combining data from structured, semi-structured, and unstructured analytics services is very challenging, often involving programmatic or manual correlation
In Matthew’s case, he needs to provide the researcher with anonymized tumor images, nurses’ notes, DNA and genome sequences, blood values, pre-conditions, and previous treatments or medical history. And soon also sensor measures, and possibly video or audio data with the increased use of device technology and telemedicine in medical care. This data needs to be seamlessly joined in the analytics he wants to provide to the researchers he will support. The data challenge is just one dimension, Matthew expressed open concern around how they will meet their – now much larger – organization’s needs with limited resources and without extensive funding for additional IT.
- Scale to provide 1,000s of researchers frictionless interaction with data
How can Matthew support 1,000s of medical researchers, medical staff, patent lawyers, auditors, and even patients themselves, with an easily accessible, shared knowledge platform? How can he make it easy to see statistics, and do calculations, on discovered commonalities, across structured and unstructured data? How can users drill down, in non-technical ways, to quickly interact with data that explains what correlations seem to matter? How can users quickly pull up historical trails, what results were generated, what steps were involved, and compare against external text data? - Innovate on serviceability and optimize utilization
How can Matthew create a digital collaboration place that is data-driven, where the users (of different security access levels and different technical skills) are allowed to investigate and explore various types of data sets, together, while following their line of thought – as they iteratively and interactively drill down for more insights? How will Matthew innovate, while in constant fire fighting mode, with little energy left to spare. - Protect data and create trust in providers
At the same time, Matthew knows that his patients’ data needs to be secure, governed, and protected – sometimes anonymized and masked. This is essential to allow participating trial patients to feel safe and confident sharing their sensitive data, as well as for the organization to comply with all necessary regulations, audit, and compliance rules.
Before now, this was really hard to do. Traditional systems are siloed, hard to access and often structured to serve traditional reports. Legacy systems do not scale with the new data needs. So, what system can handle these kinds of challenges? How could Matthew serve all this data, together, in an easily consumable way, without losing focus on his core business: finding a cure for cancer.
The Vision of a Discovery Data Warehouse
A discovery data warehouse is a modern data warehouse that easily allows for augmentation of existing reports and structured data with new unstructured data types, and that can flexibly scale with volume and compute needs. New data types such as images, text, DNA sequence data, audio files, and more, need to be joined into existing reports and used as relevance-based filters onto the structured data. Typically these disparate data sets live in different silos, but through a Discovery Data Warehouse, these data sets can be weaved together through a common SQL query, to form curated data marts for further easy, ad-hoc exploration.
A Discovery Data Warehouse is cloud-agnostic. Data may live anywhere or be provided by a 3rd-party in their chosen, specific cloud environment. Access to valuable data should not be hindered by the technology. Hence, it should seamlessly live in a hybrid cloud world, as not all data is approved to move to cloud.
This type of data warehouse allows queries to federate out to other stores and query engines that allow for unstructured data access and matching and allows joining the result set with existing tables. This enables all data to be treated as a single resource, accessed from a single query, without additional programming or duct tape integrations, without manually correlating or merging result sets together to arrive at a single answer. This is all done using a simple SQL query, a familiar language for any data professional. The discovery data warehouse makes data types ordinarily outside a SQL Query engine’s capability easily accessible, thereby extending the value of the language and the reach of the data professional. Which is invaluable to organizations such as Matthews.
As Augmented Analytics is on the rise, a discovery data warehouse is key for not only pharmaceuticals but any businesses that heavily rely on unstructured data, such as healthcare providers, insurance, government, media, various ML and Risk modeling heavy organizations, as well as legal/law enforcement and a variety of auditing services.
A Discovery Data Warehouse is defined by:
- An integrated way to ingest vast, voluminous, and high-speed data – data will come from various sources of various pace and volume. The Discovery Data Warehouse needs to have out of the box flexible ingest frameworks that suits the data variety and volume at hand.
- A rapid way to make vast amounts of text data available to non-technical users – data needs to be easily queried through natural language and assist line of thought interaction, ad hoc. Not all users are programmatically savvy, yet data needs to be accessible to all users depending on it.
- The ability to easily combine insights over structured and unstructured data (tables and text foremost, but also voice, images, genome, etc.) – structured SQL engines are not enough to serve the new needs of combining data sets. The Data Warehouse needs to seamlessly extend into the power of other query frameworks and engines.
- A way to meet SLAs and speed up time to market for ad-hoc investigations, experiments, and short-lived projects, with isolated compute and storage resources – the queries are exploratory in nature, once an experiment is started. You may not know all resources needed. You also don’t want to impact existing production SLAs. Hence, an architecture where compute and storage are separated, and where various compute allocation is isolated from each other, is necessary for IT to be able to serve Line of Business well.
- Auto-scaling capabilities, as one can’t predict in advance what projects or what data sets will take off and become really popular, attracting more users and more experiments – this builds onto cloud-native technologies such as Kubernetes and containerization.
- An integrated, out of the box visualization and dashboard service that is integrated across multiple compute options, to expedite cross-organization collaboration with a consistent visual language
- An easy to generate audit report and data lineage view – no matter where data lives, or through what query engine it has been accessed, and who accessed it when
- A quick and easy way to publish results to others, to accelerate results through active collaboration, even across organizational borders
- An integrated unstructured data match engine to find similar documents, reports, articles, and patents, using natural language and/or in combination with SQL
- A subscription mechanism for researchers for when new data is available tagged to an ongoing trail, project, or experiment
- A recommendation engine for data sets, to expedite data discovery for researchers interested in a topic
- Security on all possible levels, including but not limited to: data files, tables, rows, indexes, index documents, attribute-level, including user authentication, authorization, masking, and data encryption at rest as well as in motion. And most importantly, once shared security model across data sets and compute accessing the data
- Ease of deployment and procurement, in hybrid/multi-cloud environments for fast availability of critical resources
- An open API to allow any 3rd party data tooling, especially for ingesting industry-specific, special data formats, and for consumption, as researchers may have their own favorite tools
- A seamless way to apply Machine Learning (ML) to the same data sets, without switching systems and copying data into and out of additional, possibly proprietary formats. This is actually often the next step after Discovery Data Warehouse based analytics to use ML to define new data sets of interest, and then use those to augment existing reports. So ML abilities and Discovery Data Warehousing goes hand in hand.
In Matthew’s pharma above, a self-service Discovery Data Warehouse service would help his team tremendously. His team will not have to build the know-how and skills to programmatically implement an integration between different data access engine types, or rely on external resources. It would enable faster experimentation with easy, protected, and governed access to a variety of data. It will also allow data professionals to self-serve and collaborate through easy dashboarding. Using an out-of-the-box visual layer across data and compute for insight acceleration among non-technical consumers will expedite their research confidence in, and time to market of, new treatments. A Discovery Data Warehouse will allow them to quickly, and with little headache around provisioning, maintenance, and future scale, serve their LoB needs, while preserving costs and meeting SLAs.
Matthew’s company is not alone in this situation. Pretty much any Pharma could use the same solution for their knowledge and discovery layer, and to help fuel more accurate Machine Learning models. The need goes beyond Pharma: any research department in any industry could also take advantage.
We believe Discovery Data Warehousing will be the leading data strategy trend in industries with heavy reliance on unstructured data, including agriculture, aerospace, medicine, legal, law enforcement, quality and audit regulatory services, or government, for years to come.
For Matthew, and our other customers struggling with the challenges of making all their data work for their discovery needs, Cloudera has delivered and continues to invest in, one of the first fully-integrated Discovery Data Warehouses on the market. Stay tuned for the next blog post that will dive deeper into this topic! Meanwhile, you might want to explore CDP Data Visualization and Cloudera Data Warehouse, serving as a modern data warehouse for thousands of customers around the world, and available to try today. As part of my research and work for this blog, I especially want to thank Merv Adrian, Sanjeev Mohan, Mark Ramsey, and Phillip Radley.
A really informative blog on the power of data analytics. Every enterprise search tool today, like 3RDi Search, Algolia and Coveo, are empowered with data analytics.