Across the federal government, agencies are struggling to identify, organize, analyze, and act on troves of data. It’s a problem that leaders are working actively to tackle, but they’re in a race against immeasurable volumes of data that is continuously being generated in perpetuity in stores known and unknown.
At the Internal Revenue Service, decades’ worth of data exceeds even the most cutting-edge processing capabilities. By more effectively leveraging its petabytes of current and historical data, the IRS is working to stave off costly fraud and waste, more efficiently deliver on fundamental missions, and better protect taxpayers, including from risks such as identity theft.
Key to harnessing the power of all that data: high-powered artificial intelligence tools, machine learning capabilities and applications capable of rapidly exposing attempts at fraud or identity theft.
The IRS has spent more than a decade working to combat high-cost hazards, including launching a collaborative Identity Theft Tax Refund Fraud Information Sharing and Analysis Center (ISAC) pilot for the 2017 tax-filing season, advancing authentication tools and taking proactive steps in fighting business identity theft.
“The IRS continues to evaluate and expand on successful fraud detection initiatives, while also piloting new fraud detection initiatives,” according to a July 2020 report from the Treasury Inspector General for Tax Administration. “The actions taken on the part of the IRS have been extremely effective in addressing the identity theft epidemic and reducing its negative impact on tax administration.”
Now, the agency is turning a corner on putting its mass quantities of data to work, collaborating with enterprise data engineers in industry to utilize those data troves to better protect taxpayers.
Pairing High-Tech Keys to Unlock the Power of Data
Through a recent collaboration between Cloudera and NVIDIA, these engineers are tackling massive IRS data bottlenecks by integrating the Cloudera Data Platform cloud infrastructure with NVIDIA’s RAPIDS libraries for Apache Spark 3.0. The combination of the Cloudera cloud infrastructure and NVIDIA-Certified Systems–industry-standard servers accelerated with NVIDIA GPUs–enables faster, easier implementation of AI and machine learning at scale.
In turn, by developing workloads that use Apache Spark and graph analysis, the engineering teams created immense graphs with nodes and edges, connecting individuals to institutions and, subsequently, to larger entities spanning years and decades. AI “bots” and ML algorithms can quickly and repeatedly analyze these graphs to root out anomalies in behavior or patterns that signal potential fraud.
The result: Magnitudes of data sets that used to take weeks or months to stitch together and analyze—if the IRS could do so on existing machines at all—now can be processed in days, hours, or even minutes. Recent testing on the project demonstrated 10 times faster engineering and data science workflows and a 50 percent reduction in infrastructure costs.
“We need to be able to make accurate decisions at speed while utilizing vast swathes of data. That challenge is ever-evolving as data volumes and velocities continue to increase,” said Joe Ansaldi, IRS/Research Applied Analytics & Statistics Division (RAAS)/Technical Branch Chief. “The Cloudera and NVIDIA integration will empower us to use data-driven insights to power mission-critical use cases such as fraud detection…simply by adding GPUs to mainstream big data servers.”
With a shared, broadly beneficial goal of detecting fraudulent tax behavior and shutting down misuse of the system, Cloudera and NVIDIA are helping the IRS transform its volumes of data into actionable tools that better safeguard the American public, the agency’s constituents as well as its critical missions. In the process, the collaboration—including the IRS’s role as a design partner—is uncovering new, additional functionalities of tools and technologies and re-engineering software, architectures, approaches, and timelines.