Author Archives: Tom White

Large-Scale Health Data Analytics with OHDSI

Categories: CDH Data Science

Data analytics is increasingly being brought to bear to treat human disease, but as more and more health data is stored in computer databases, one significant challenge is how to perform analyses across these disparate databases. In this post I take a look at the Observational Health Data Sciences and Informatics (or OHDSI, pronounced “Odyssey”) program that was formed to address this challenge, and which today accounts for 1.26 billion patient records collectively stored across 64 databases in 17 countries.

Read more

Understanding how Deep Learning learns to play SET®

Categories: CDH Cloudera Data Science Workbench Data Science

In the past few years, deep learning has seen incredible success in image recognition applications. In this post I examine how to train a convolutional neural network to recognize playing card images from a game called SET®, explore the structure of the model to get some insight into what it is “seeing”, and present a webcam application that uses the deployed model in a near-realtime setting.

SET is a card game where the objective is to find triples of cards,

Read more

Hail: Scalable Genomics Analysis with Apache Spark

Categories: CDH Data Science Spark

Technology-focused discussions about genomics usually highlight the huge growth in DNA sequencing since the beginning of the century, growth that has outpaced Moore’s law and resulted in the $1000 genome. However, future growth is projected to be even more dramatic. In the paper “Big Data: Astronomical or Genomical?”, the authors say it is estimated that “between 100 million and as many as 2 billion human genomes could be sequenced by 2025”,

Read more

How-to: Write Apache Hadoop Applications on OpenShift with Kite SDK

Categories: Cloud Hadoop How-to Kite SDK

The combination of OpenShift and Kite SDK turns out to be an effective one for developing and testing Apache Hadoop applications.

At Cloudera, our engineers develop a variety of applications on top of Hadoop to solve our own data needs (here and here). More recently, we’ve started to look at streamlining our development process by using a PaaS (Platform-as-a-Service) for some of these applications. Having single-click deployment and updates to consistent development environments lets us onboard new developers more quickly,

Read more