Spotlight: How National Institutes of Health Advances Genomic Research with Big Data

Categories: Hadoop Use Case

This week, I’d like to shine a spotlight on innovative work the National Institutes of Health (NIH) is working on, leveraging Big Data, in the area of genomic research. Understanding DNA structure and functions is a very data-intensive, complex, and expensive undertaking. Apache Hadoop is making it more affordable and feasible to process, store, and analyze this data, and the NIH is embracing the technology for this reason. In fact, it has initiated a Big Data center of excellence — which it calls Big Data to Knowledge (BD2K) — to accelerate innovations in bioinformatics using Big Data, which will ultimately help us better understand and control various diseases and disorders.

Bob Gourley — a friend of Cloudera’s who wears many hats including publisher of, CTO of Crucial Point LLC, and GigaOm analyst — recently interviewed Dr. Mark Guyer, the deputy director of the NIH’s National Human Genome Research Institute (NHGRI), about the BD2K effort.

Some key highlights from the interview:

  • The ability to generate genomic sequencing data has improved more than a million-fold since the beginning of the Human Genome Project.
  • Goals of BD2K:
    • To enhance the ability to analyze all types and volumes of data;
    • To maximize the value of the growing volume and complexity of biomedical data;
    • To advance the discipline of data science in the community by helping to develop and disseminate innovative analysis methods, tools and techniques.
  • NIH supports widespread data sharing, including across government, as key to rapid progress.

To read the full interview and for greater context, check out Bob Gourley’s blog.

What other industries and lines of work would benefit from this kind of Big Data Center of Excellence? Add your comments and thoughts here!

Karina Babcock is Cloudera’s Customer Programs & Marketing Manager.