This is a guest post by Oliver Guinan, VP Ground Software, at Skybox Imaging. Oliver is a 15-year veteran of the internet industry and is responsible for all ground system design, architecture and implementation at Skybox.
One of the great promises of the big data movement is using networks of ubiquitous sensors to deliver insights about the world around us. Skybox Imaging is attempting to do just that for millions of locations across our planet.
Skybox is developing a low cost imaging satellite system and web-accessible big data processing platform that will capture video or images of any location on Earth within a couple of days. The low cost nature of the satellite opens the possibility of deploying tens of satellites which, when integrated together, have the potential to image any spot on Earth within an hour.
Skybox satellites are designed to capture light in the harsh environment of outer space. Each satellite captures multiple images of a given spot on Earth. Once the images are transferred from the satellite to the ground, the data needs to be processed and combined to form a single image, similar to those seen within online mapping portals.
With any sensor network, capturing raw data is only the beginning of the story. We at Skybox are building a system to ingest and process the raw data, allowing data scientists and end users to ask arbitrary questions of the data, then publish the answers in an accessible way and at a scale that grows with the number of satellites in orbit. We selected Cloudera to support this deployment.
Processing raw imagery is a complex computer vision task that involves many pixel-level calculations over multiple images. Image Scientists create algorithms in C and C++ to efficiently perform these calculations. Hadoop prefers MapReduce jobs written in Java, so we have developed a proprietary framework called BusBoy to wrap the native algorithms into a standard Hadoop job. This allows our Hadoop engineers to develop efficient storage and publication solutions while our Image Scientists focus on developing better image processing algorithms.
Developing against CDH and using Puppet to manage our deployed extensions and configurations allows Skybox to develop our architecture on our in-house cluster. Once the solution is robust, we then have the option to deploy our solution at scale using Amazon’s EC2 hardware or other scalable computation and storage platforms. We have tested a large number of hardware configurations to validate our scalability assumptions and to determine the right balance between CPU, memory, disk, and network resources. This information informs the purchasing process for our next in-house cluster.
Making all data available on spinning disk allows data scientists to efficiently ask any question of the data. Traditional systems tend to archive older data to tape based systems. This makes speculative examination of the data prohibitively expensive. The Hadoop ecosystem of large scale compute and storage coupled with Apache Oozie‘s ability to chain complex processing jobs together that publish results to accessible, structured storage in Apache Hive and Apache HBase is allowing Skybox to create a sensor network that takes the pulse of the planet 24×7.
About Skybox Imaging
Skybox Imaging is a commercial, remote sensing start-up revolutionizing access to information that describes daily activity on our planet. Founded in 2009 and backed by leading venture firms, the company is designing, manufacturing, and operating the world’s first coordinated constellation of high-resolution microsatellites. With its constellation, Skybox will deliver timely, global imagery and video as well as an analytics platform capable of creating new sources of value from such data. Skybox is headquartered in Mountain View, California, and was named to MIT Technology Review’s “Top 50 Most Innovative Companies” for 2012. For more information, visit www.skyboximaging.com or follow Skybox Imaging on Twitter.