When most people first hear about data science, it’s usually in the context of how prominent web companies work with very large data sets in order to predict clickthrough rates, make personalized recommendations, or analyze UI experiments. The solutions to these problems require expertise with statistics and machine learning, and so there is a general perception that data science is intimately tied to these fields. However, in my conversations at academic conferences and with Cloudera customers, I have found that many kinds of scientists– such as astronomers, geneticists, and geophysicists– are working with very large data sets in order to build models that do not involve statistics or machine learning, and that these scientists encounter data challenges that would be familiar to data scientists at Facebook, Twitter, and LinkedIn.
The Practice of Data Science
The term “data science” has been subject to criticism on the grounds that it doesn’t mean anything, e.g., “What science doesn’t involve data?” or “Isn’t data science a rebranding of statistics?” The source of this criticism could be that data science is not a solitary discipline, but rather a set of techniques used by many scientists to solve problems across a wide array of scientific fields. As DJ Patil wrote in his excellent overview of building data science teams, the key trait of all data scientists is the understanding “that the heavy lifting of [data] cleanup and preparation isn’t something that gets in the way of solving the problem: it is the problem.”
I have found a few more characteristics that apply to the work of data scientists, regardless of their field of research:
- Inverse problems. Not every data scientist is a statistician, but all data scientists are interested in extracting information about complex systems from observed data, and so we can say that data science is related to the study of inverse problems. Inverse problems arise in almost every branch of science, including medical imaging, remote sensing, and astronomy. We can also think of DNA sequencing as an inverse problem, in which the genome is the underlying model that we wish to reconstruct from a collection of observed DNA fragments. Real-world inverse problems are often ill-posed or ill-conditioned, which means that scientists need substantive expertise in the field in order to apply reasonable regularization conditions in order to solve the problem.
- Data sets that have a rich set of relationships between observations. We might think of this as a kind of Metcalfe’s Law for data sets, where the value of a data set increases nonlinearly with each additional observation. For example, a single web page doesn’t have very much value, but 128 billion web pages can be used to build a search engine. A DNA fragment in isolation isn’t very useful, but millions of them can be combined to sequence a genome. A single adverse drug event could have any number of explanations, but millions of them can be processed to detect suspicious drug interactions. In each of these examples, the individual records have rich relationships that enhance the value of the data set as a whole.
- Open-source software tools with an emphasis on data visualization. One indicator that a research area is full of data scientists is an active community of open source developers. The R Project is a widely known and used toolset that cuts across a variety of disciplines, and has even been used as a basis for specialized projects like Bioconductor. Astronomers have been using tools like AIPS for processing data from radio telescopes and IRAF for data from optical telescopes for more than 30 years. Bowtie is an open source project for performing very fast DNA sequence alignment, and the Crossbow Project combines Bowtie with Apache Hadoop for distributed sequence alignment processing.
We can use the term “data scientist” as a specialization of “scientist” in the same way that we use the term “theoretical physicist” as a specialization of “physicist.” Just as there are theoretical physicists that work within the various subdomains of physics, such as cosmology, optics, or particle physics, there are data scientists at work within every branch of science.
Data Scientists Who Find Oil: Reflection Seismology
Reflection seismology is a set of techniques for solving a classic inverse problem: given a collection of seismograms and associated metadata, generate an image of the subsurface of the Earth that generated those seismograms. These techniques are primarily used by exploration and production companies in order to locate oil and natural gas deposits, although they were also used to identify the location of the Chicxulub Crater that has been linked to the extinction of the dinosaurs.
Seismic data is collected by surveying an area that is suspected to contain oil or gas deposits. Seismic waves are generated from a source, which is usually an air gun in marine surveys or a machine called a Vibroseis for land-based surveys. The seismic waves reflect back to the surface at the interfaces between rock layers, where an array of receivers record the amplitude and arrival times of the reflected waves as a time series, which is called a trace. The data that is generated from a single source is called a shot or shot record, and a modern seismic survey may consist of tens of thousands of shots and multiple terabytes of trace data.
In order to solve the inversion problem, we take advantage of the geometric relationships between traces that have different source and receiver locations but a common depth point (also known as a common midpoint). By comparing the time it took for the seismic waves to travel from the different source and receiver locations and experimenting with different velocity models for the waves moving through the rock, we can estimate the depth of the common subsurface point that the waves reflected off of. By aggregating a large number of these estimates, we can construct a complete image of the subsurface. As we increase the density and the number of traces, we can create higher quality images that improve our understanding of the subsurface geology.
Additionally, seismic data processing has a long history of using open-source software tools that were initially developed in academia and were then adopted and enhanced by private companies. Both the Seismic Unix project, from the Colorado School of Mines, and SEPlib, from Stanford University, have their roots in tools created by graduate students in the late 1970s and early 1980s. Even the most popular commercial toolkit for seismic data processing, SeisSpace, is built on top of an open source foundation, the JavaSeis project.
Hadoop and Seismic Data Processing
Geophysicists have been pushing the limits of high-performance computing for more than three decades; they were early adopters of the first Cray supercomputers as well as the massively parallel Connection Machine. Today, the most challenging seismic data processing tasks are performed on custom compute clusters that take advantage of multiple GPUs per node, high-performance networking and storage systems for fast data access.
The data volume of modern seismic surveys and the performance requirements of the compute clusters means that data from seismic surveys that are not undergoing active processing are often stored offsite on tape. If a geophysicist wants to re-examine an older survey, or study the effectiveness of a new processing technique, he must file a request to move the data into active storage and then consume precious cluster resources in order to process the data.
Fortunately, Apache Hadoop has emerged as a cheap and reliable online storage system for petabyte-scale data sets. Even better, we can export many of the most I/O intensive steps in the seismic data processing into the Hadoop cluster itself, thus freeing precious resources in the supercomputer cluster for the most difficult and urgent processing tasks.
Seismic Hadoop is a project that we developed at Cloudera to demonstrate how to store and process seismic data in a Hadoop cluster. It combines Seismic Unix with Crunch, the Java library we developed for creating MapReduce pipelines. Seismic Unix gets its name from the fact that it makes extensive use of Unix pipes in order to construct complex data processing tasks from a set of simple procedures. For example, we might build a pipeline in Seismic Unix that first applies a filter to the trace data, then edits some metadata associated with each trace, and finally sorts the traces by the metadata that we just edited:
sufilter f=10,20,30,40 | suchw key1=gx,cdp key2=offset,gx key3=sx,sx b=1,1 c=1,1 d=1,2 | susort cdp gx
Seismic Hadoop takes this same command and builds a Crunch pipeline that performs the same operations on a data set stored in a Hadoop cluster, replacing the local susort command with a distributed sort across the cluster using MapReduce. Crunch takes care of figuring out how many MapReduce jobs to run and which processing steps are assigned to the map phase and which are assigned to the reduce phase. Seismic Hadoop also takes advantage of Crunch’s support for streaming the output of a MapReduce pipeline back to the client in order to run the utilities for visualizing data that come with Seismic Unix.
Challenges to Solve Together
Talking to a geophysicist is a little bit like seeing into the future: the challenges they face today are the challenges that data scientists in other fields will be facing five years from now. There are two challenges in particular that I would like the broader community of data scientists and Hadoop developers to be thinking about:
- Reproducibility. Geophysicists have developed tools that make it easy to understand and reproduce the entire history of analyses performed on a particular data set. One of the most popular open source seismic processing toolkits, Madagascar, even chose reproducibility.org as its home page. Reproducible research has enormous benefits in terms of data quality, transparency, and education, and all of the tools we develop should be built with reproducibility in mind.
- Dynamic and resource-aware scheduling of jobs on heterogeneous clusters. MR2 and YARN will unleash a Cambrian explosion of data-intensive processing jobs on Hadoop clusters. What was once only MapReduce jobs will now include MPI jobs, Spark queries, and BSP-style computations. Different jobs will have radically different resource requirements in terms of CPU, memory, disk, and network utilization, and we will need fine-grained resource controls, intelligent defaults, and robust mechanisms for recovering from task failures across all job types.
It is an unbelievably exciting time to be working on these big data problems. Join us and be part of the solution!