How a modern data platform supports government fraud detection

November 15-21 marks International Fraud Awareness Week – but for many in government, that’s every week. From bogus benefits claims to fraudulent network activity, fraud in all its forms represents a significant threat to government at all levels.

Some experts estimate the U.S. government loses nearly 150 billion dollars due to potential fraud each year, McKinsey & Company reports. Fraud against the government takes many forms, including identity theft, dubious procurement, redundant payments, and payments for services that did not occur, just to name a few. Furthermore, the same tools that empower cybercrime can drive fraudulent use of public-sector data as well as fraudulent access to government systems.

Technology can help. In financial services, another highly regulated, data-intensive industry, some 80 percent of industry experts say artificial intelligence is helping to reduce fraud. The Association of Certified Fraud Examiners reports the use of artificial intelligence and machine learning in anti-fraud programs is expected to almost triple in the next two years. To use such tools effectively, though, government organizations need a consolidated data platform–an infrastructure that enables the seamless ingestion and integration of widely varied data, across disparate systems, at speed and scale.

Machine learning algorithms enable fraud detection systems to distinguish between legitimate and fraudulent behaviors.  Some of these algorithms can be adaptive to quickly update the model to take into account new, previously unseen fraud tactics allowing for dynamic rule adjustment. This type of framework requires a streaming environment that provides continuous updates coupled with model performance monitoring to ensure consistent performance. 

The Public Sector data challenge

Modernization has been a boon to government. Robust online systems have streamlined interactions and generated a wealth of new data to support mission success and enhanced citizen engagements.

However, this rapid scaling up of data across government agencies brings with it new challenges. Bad actors can potentially access and leverage this information, utilizing the tools and techniques of cybercrime to perpetrate various forms of fraudulent activity.

Robust detection-and-response depends on the ability to spot anomalous activity on the system. An algorithmic approach can reduce the human workload and help agencies to focus their anti-fraud efforts.

Too often, though, legacy systems cannot deliver the needed speed and scalability to make these analytic defenses usable across disparate sources and systems.

For many agencies, 80 percent of the work in support of anomaly detection and fraud prevention goes into routine tasks around data management. Inordinate time and effort are devoted to cleaning and preparing data, resulting in data bottlenecks that impede effective use of anomaly detection tools. A better approach is needed.

A solid foundation for fraud detection

A platform approach offers government entities a solid infrastructure upon which to build their fraud prevention and detection efforts. Cloudera Data Platform (CDP) is a solution that integrates open-source tools with security and cloud compatibility. It enables public sector agencies to tap the power of big data and machine learning by managing volume, velocity, and variety of data, bringing unified governance, data integration, and open-source data management capabilities across all cloud and hybrid-cloud environments.

  • Governance: With a unified data platform, government agencies can apply strict and consistent enterprise-level data security, governance, and control across all environments. As a purely open-source solution supporting CDP, Cloudera’s SDX (Shared Data Experience) provides a critical layer of context by applying centralized, consistent data access services across all platform deployments. Data being prepared for anomaly detection and other uses thus retains its integrity, with provenance and fidelity ensured throughout the data lifecycle.
  • Integration: With CDP, government can get its data out of silos and connect disparate workloads to develop critical apps on a single data management platform. Data can be integrated regardless of its source or format, giving agencies a solid foundation upon which to deploy the critical analytic tools that support fraud detection and prevention.
  • Open & accessible: In order to support effective anti-fraud measures, data must be not only clean and readable but also readily available. CDP works across private and hybrid cloud environments, and because it is built on open source capabilities, it is interoperable with a broad range of current and emerging analytic and business intelligence applications. Open source software likewise helps to future-proof the platform, ensuring government agencies will always be on the cutting edge of innovation. 

Fraudulent Activity Detection

Analyzing historical data is an important strategy for anomaly detection. By modeling normal behavior, we can exploit existing insights to identify deviations.

The modeling process begins with data collection. Here, Cloudera Data Flow is leveraged to build a streaming pipeline which enables the collection, movement, curation, and augmentation of raw data feeds. These feeds are then enriched using external data sources (e.g., telemetry events, asset information, and GeoIP) and cleansed, organized, and prepared for machine learning using Cloudera Data Engineering.

Once the data has been collected, enriched, and normalized, we can begin building anomalous activity detection models using a wide range of machine learning techniques:

  • Clustering – Normal data points lie close to the centroid of a cluster, while anomalies serve as outliers and do not belong to any clusters.
  • Nearest Neighbor – Normal data points occur in close proximity, while anomalous data points are far from any neighbors.
  • Classification – Learning based on labeled data to include known normal and abnormal data points.
  • Deep learning – Using a defined stochastic model, normal data points fall in high probability regions, while abnormal data points fall in low-probability regions.

Each of these techniques leverages the processed data to learn a model of normal behavior which is used to assign anomaly scores. These scores enable a concept called threshold-based anomaly tagging. Here, a threshold is defined to drive the anomaly detection process (i.e., scores below a threshold are tagged as normal, and scores above a threshold are tagged as anomalous).

Reducing misclassifications is imperative for a fraudulent activity model, therefore careful examination of the model accuracy, precision, recall, and F1 score must occur. Once ready, the model can be operationalized through Cloudera Machine Learning and its RESTful endpoints can be pushed into the streaming data ingestion pipeline, enabling real-time fraudulent activity detection.

A cloud-native platform approach ensures data will be not only clean and accessible, but also readily scalable, enabling agencies to deliver system integrity at the scale and speed demanded by modern, technology-driven objectives. With a data platform in place to do the heavy lifting around data prep and integration, government agencies are empowered to invest more resources in developing and deploying their fraud detection algorithms. 

As the digitalization of the public sector continues to evolve, effective fraud prevention requires sophisticated analytical approaches driven by a holistic approach to data and advanced machine learning capabilities. Visit the Cloudera Fraud Resource Kit for more fraud prevention resources and to access additional information about how Cloudera Data Platform enables an end to end data platform and machine learning models that facilitate fraud detection and prevention.  Learn more: Fraud Prevention Resource Kit.

Nasheb Ismaily
Senior Solutions Engineer, Public Sector
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.