The U.S. Census Enters the Digital Age with Cloudera

2020 brings a new decade, and for the U.S Census Bureau, a new challenge. As the federal government’s—and the nation’s—leading provider of demographic and economic data, its largest initiative is the U.S. Census, which is conducted every 10 years and counts every resident in the United States. 

For the first time in U.S history, the census will be conducted primarily online instead of by mail. The 2020 census will count around 330 million people in more than 140 million housing units, generating an unprecedented amount of data that must be collected, stockpiled, secured and interpreted. 

In order to provide the processing capacity needed, bureau leadership established the Census Enterprise Data Lake (EDL) initiative. As the chosen data platform for the 2020 census, Cloudera will help amass and derive actionable insights. Open-source technology and high-performance cloud infrastructure will transform how census data is processed—and the value and impact it will deliver. These revolutionized approaches are designed to streamline data collection, reducing the margin of error and improving quality control.

The data lake “provides a centralized repository to consolidate operational paradata, response data, and cost data from multiple modes of data collection. It provides a single place to analyze all operational data and make informed decisions during operations,” Census CIO Kevin Smith recently noted.

The data platform provides adaptability for the Census Bureau and easily integrates with the dozens of other technology vendors involved in producing the census. It leverages the entire technology stack and full range of professional service offerings. Cloudera DataFlow (CDF®) will pull in data and analyze in real-time, while Hortonworks Data Platform (HDP®) serves as the data lake and repository for the quantities of data collected. 

The EDL initiative reduces costs and improves efficiency for the Census Bureau, with advantages such as decreased redundancy, faster corrections and improved analysis of responses. It’s a boon for the American public, too, as participants spend less time responding to repetitive questions or fixing errors thanks to the electronic interface that allows them to reuse answers. The EDL also provides enterprise-level security, privacy and policy controls, safeguarding sensitive data and code. 

Finally, the introduction of EDL technology facilitates easier information-sharing within the bureau and with officials at other government agencies, providing data scientists with access to census-based insights and better-informing future decisions across organizations. 

For Smith, the kickoff of the census marks an important milestone, realizing goals he first laid out early in his tenure as Census CIO.

One of his top priorities when arriving in 2016 was to “streamline the way we collect and analyze and disseminate data into secure platforms that offer the flexibility for the end-user … but at the same time secure the data and the platforms into a common maintainable format,” he said in a Federal Times interview. Noting that the bureau’s work spans economic data, demographic data and many different surveys, he added that “a lot of stuff we’re learning and setting up for the 2020 census, based on the scale of the census, can be and will be reused within the rest of the other surveys to collect data in a secure fashion and to maintainable and flexible platforms.”

As the 2020 U.S. Census gets under way, it’s breaking new ground in what’s technologically possible—and it’s also creating a foundation for easier governmentwide adoption of data tools that amplify capabilities and accelerate the mission.

You can read more about the Census Enterprise Data Lake and Cloudera’s role in this recent Meritalk article or on our customer page here

Shaun Bierweiler
VP and GM, Cloudera Government Solutions
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.