Strategies for Exploiting Large-scale Data in the Federal Government

Categories: Community General Guest Hadoop

This is a guest post by Bob Gourley (@bobgourley), editor of and a former Defense Intelligence Agency (DIA) CTO.

Like enterprises everywhere, the federal government is challenged with issues of overwhelming data. Thanks to a mature Apache Software Foundation suite of tools and a strong ecosystem around large-scale data storage and analytical capabilities, these challenges are actually turning into tremendous opportunities.

The following characterizes current federal approaches to working with complex data:

  • Federal IT leaders are increasingly sharing lessons learned across agencies. But approaches vary from agency to agency. That’s to be expected; the different agencies pursue different missions and have different problems to solve. Each evaluates the tools that best address its requirements.
  • Nevertheless, federal thought leaders across all agencies are confronted with more data from more sources, and a need for more powerful analytic capabilities, in order to optimize support to their individual missions.
  • For some agencies there are also temporal aspects mission requirements. Large-scale distributed analysis over large data sets is often expected to return results almost instantly.
  • Most agencies face challenges that involve combining multiple data sets – some structured, some complex – in order to answer mission questions. For national security missions, this frequently requires combining streaming data with previously-captured records.
  • Federal IT leaders are increasingly seeking automated tools, more advanced models and means of leveraging commodity hardware and open source software to conduct distributed analysis over distributed data stores.
  • Some particularly forward-thinking executives in government are considering ways of enhancing the ability of citizens to contribute to government understanding by use of crowd-sourcing type models. As these concepts move forward additional opportunities in large-scale data analysis will no doubt arise.
  • There is a growing consensus in the federal space that design of next-generation data management architectures must be treated as a discipline. The entire federal community needs an advanced body of knowledge and best practices for the design and use of large-scale, distributed data analysis systems.
  • CIOs in the federal space, like CIOs everywhere, appreciate open source, but they demand commercially supported open source before fielding production systems.

Those many observations call out for approaches like Cloudera’s Distribution for Apache Hadoop (CDH). By leveraging this commercially supported open source analysis platform, federal enterprises can avail themselves of the greatest analytical capabilities in industry today, and they can do so with zero barriers to entry. By building new data management solutions on low cost commodity IT and Cloudera’s Distribution for Apache Hadoop, enterprises know that they are using the most up to date Apache source code, with critical fixes and many additional tools and features from the community included. The most popular projects that complement the Hadoop core are also included and the entire bundle is tested and maintained and supported the way CIOs expect quality production software to be supported.

What solutions can we expect federal thought leaders will provide using Hadoop-based distributed analysis systems? I believe we are all about to be surprised by solutions we could never expect. But for sake of dialog we should look at the very fertile/high-quality data stores currently in the federal government and then consider the types of solutions being fielded in the most forward leaning portions of our economy. That may help us understand some of the coming solutions.

The government has many data stores. Consider, for example, government data on weather, climate, the environment, pollution, health, quality of life, the economy, natural resources, energy and transportation. Data on those topics exist in many stores across the federal enterprise. The government also has years of information from research conducted at academic institutions across the nation. Imagine the conclusions that could be drawn from distributed analysis over datastores like this. Imagine the benefits to our citizen’s health, commodity prices, education and employment of better analysis over these data stores.

Now think through the powerful new data analysis solutions being fielded in the commercial sector. Think of Facebook at their 620 Million users and the real-time data architecture supporting them. Think of Twitter and the ability to rapidly pull rising trends and search for meaning over the body of interactions there. Think of LinkedIn and the ability to rapidly track changes in status and find connections to the right researchers/thinkers/partners. Think of Groupon and its ability to serve local users with information relevant to their lives. There are solutions today to large data issues in the commercial space that are directly relevant to solutions in the federal space. What is required now is execution, not engineering.

The Cloudera Distribution for Apache Hadoop plus documentation is available for free download at:

And extensive training and educational material is available at:


2 responses on “Strategies for Exploiting Large-scale Data in the Federal Government