Fraud Detection with Cloudera Stream Processing Part 1

In a previous blog of this series, Turning Streams Into Data Products, we talked about the increased need for reducing the latency between data generation/ingestion and producing analytical results and insights from this data. We discussed how Cloudera Stream Processing (CSP) with Apache Kafka and Apache Flink could be used to process this data in real time and at scale. In this blog we will show a real example of how that is done, looking at how we can use CSP to perform real-time fraud detection.

Building real-time streaming analytics data pipelines requires the ability to process data in the stream. A critical prerequisite for in-stream processing is having the capability to collect and move the data as it is being generated at the point of origin. This is what we call the first-mile problem. This blog will be published in two parts. In part one we will look into how Cloudera DataFlow powered by Apache NiFi solves the first-mile problem by making it easy and efficient to acquire, transform, and move data so that we can enable streaming analytics use cases with very little effort. We will also briefly discuss the advantages of running this flow in a cloud-native Kubernetes deployment of Cloudera DataFlow.

In part two we will explore how we can run real-time streaming analytics using Apache Flink, and we will use Cloudera SQL Stream Builder GUI to easily create streaming jobs using only SQL language (no Java/Scala coding required). We will also use the information produced by the streaming analytics jobs to feed different downstream systems and dashboards. 

The use case

Fraud detection is a great example of a time-critical use case for us to explore. We all have been through a situation where the details of our credit card, or the card of someone we know, has been compromised and illegitimate transactions were charged to the card. To minimize the damage in that situation, the credit card company must be able to identify potential fraud immediately so that it can block the card and contact the user to verify the transactions and possibly issue a new card to replace the compromised one.

The card transaction data usually comes from event-driven sources, where new data arrives as card purchases happen in the real world. Besides the streaming data though, we also have traditional data stores (databases, key-value stores, object stores, etc.) containing data that may have to be used to enrich the streaming data. In our use case, the streaming data doesn’t contain account and user details, so we must join the streams with the reference data to produce all the information we need to check against each potential fraudulent transaction.

Depending on the downstream uses of the information produced we may need to store the data in different formats: produce the list of potential fraudulent transactions to a Kafka topic so that notification systems can action them without delay; save statistics in a relational or operational dashboard, for further analytics or to feed dashboards; or persist the stream of raw transactions to a durable long-term storage for future reference and more analytics.

Our example in this blog will use the functionality within Cloudera DataFlow and CDP to implement the following:

  1. Apache NiFi in Cloudera DataFlow will read a stream of transactions sent over the network.
  2. For each transaction, NiFi makes a call to a production model in Cloudera Machine Learning (CML) to score the fraud potential of the transaction.
  3. If the fraud score is above a certain threshold, NiFi immediately routes the transaction to a Kafka topic that is subscribed by notification systems that will trigger the appropriate actions.
  4. The scored transactions are written to the Kafka topic that will feed the real-time analytics process that runs on Apache Flink.
  5. The transaction data augmented with the score is also persisted to an Apache Kudu database for later querying and feed of the fraud dashboard.
  6. Using SQL Stream Builder (SSB), we use continuous streaming SQL to analyze the stream of transactions and detect potential fraud based on the geographical location of the purchases.
  7. The identified fraudulent transactions are written to another Kafka topic that feeds the system that will take the necessary actions.
  8. The streaming SQL job also saves the fraud detections to the Kudu database.
  9. A dashboard feeds from the Kudu database to show fraud summary statistics.

Acquiring with Cloudera DataFlow

Apache NiFi is a component of Cloudera DataFlow that makes it easy to acquire data for your use cases and implement the necessary pipelines to cleanse, transform, and feed your stream processing workflows. With more than 300 processors available out of the box, it can be used to perform universal data distribution, acquiring and processing any type of data, from and to virtually any type of source or sink.

In this use case we created a relatively simple NiFi flow that implements all the operations from steps one through five above, and we will describe these operations in more detail below.

In our use case, we are processing financial transaction data from an external agent. This agent is sending each transaction as it happens to a network address. Each transaction contains the following information:

  • The transaction time stamp
  • The ID of the associated account
  • A unique transaction ID
  • The transaction amount
  • The geographical coordinates of where the transaction happened (latitude and longitude)

The transaction message is in JSON format as looks like the example below:

{

  "ts": "2022-06-21 11:17:26",

  "account_id": "716",

  "transaction_id": "e933787c-f0ff-11ec-8cad-acde48001122",

  "amount": 1926,

  "lat": -35.40439536601375,

  "lon": 174.68080620053922

}

NiFi is able to create network listeners to receive data coming over the network. For this example we can simply drag and drop a ListenUDP processor into the NiFi canvas and configure it with the desired port. It is possible to parameterize the configuration of processors to make flows reusable. In this case we defined a parameter called #{input.udp.port}, which we can later set to the exact port we need.

 

Describing the data with a schema

A schema is a document that describes the structure of the data. When sending and receiving data across multiple applications in your environment or even processors in a NiFi flow, it’s useful to have a repository where the schema for all different types of data are centrally managed and stored. This makes it easier for applications to talk to each other.

Cloudera Data Platform (CDP) comes with a Schema Registry service. For our sample use case, we have stored the schema for our transaction data in the Schema Registry service and have configured our NiFi flow to use the correct schema name. NiFi is integrated with Schema Registry and it will automatically connect to it to retrieve the schema definition whenever needed throughout the flow.

The path that the data takes in a NiFi flow is determined by visual connections between the different processors. Here, for example, the data received previously by the ListenUDP processor is “tagged” with the name of the schema that we want to use: “transaction.”

Scoring and routing transactions

We trained and built a machine learning (ML) model using Cloudera Machine Learning (CML) to score each transaction according to their potential to be fraudulent. CML provides a service with a REST endpoint that we can use to perform scoring. As the data flows through the NiFi data flow, we want to call the ML model service for data points to get the fraud score for each one of them.

We use the NiFi’s LookupRecord for this, which allows lookups against a REST service. The response from the CML model contains a fraud score, represented by a real number between zero and one.

The output of the LookupRecord processor, which contains the original transaction data merged with the response from the ML model, was then connected to a very useful processor in NiFi: the QueryRecord processor.

The QueryRecord processor allows you to define multiple outputs for the processor and associate a SQL query with each of them. It applies the SQL query to the data that is streaming through the processor and sends the results of each query to the associated output.

In this flow we defined three SQL queries to run concurrently in this processor:

 

Note that some processors also define additional outputs, like “failure,” “retry,” etc., so that you can define your own error-handling logic for your flows.

Feeding streams to other systems

At this point of the flow we have already enriched our stream with the ML model’s fraud score and transformed the streams according to what we need downstream. All that’s left to complete our data ingestion is to send the data to Kafka, which we will use to feed our real-time analytical process, and save the transactions to a Kudu table, which we’ll later use to feed our dashboard, as well as for other non-real-time analytical processes down the line.

Apache Kafka and Apache Kudu are also part of CDP, and it’s very simple to configure the Kafka- and Kudu-specific processors to complete the task for us.

Running the data flow natively on the cloud

Once the NiFi flow is built it can be executed in any NiFi deployment you might have. Cloudera DataFlow for the Public Cloud (CDF-PC) provides a cloud-native elastic flow runtime that can run flows efficiently.

Compared to fixed-size NiFi clusters, the CDF’s cloud-native flow runtime has a number of advantages:

  • You don’t need to manage NiFi clusters. You can simply connect to the CDF console, upload the flow definition, and execute it. The necessary NiFi service is automatically instantiated as a Kubernetes service to execute the flow, transparently to the user.
  • It provides better resource isolation between flows.
  • Flow executions can auto-scale up and down to ensure the right amount of resources to handle the current amount of data being processed. This avoids resource starvation and also saves costs by deallocating unnecessary resources when they are no longer used.
  • Built-in monitoring with user-defined KPIs that can be tailored to each specific flow are different granularities (system, flow, processor, connection, etc.).

Secure inbound connections

In addition to the above, configuring secure network endpoints to act as ingress gateways is a notoriously difficult problem to solve in the cloud, and the steps vary with each cloud provider. 

It requires setting up load balancers, DNS records, certificates, and keystore management. 

CDF-PC abstracts away these complexities with the inbound connections feature, which allows the user to create an inbound connection endpoint by just providing the desired endpoint name and port number.

Parameterized and customizable deployments

Upon the flow deployment you can define parameters for the flow execution and also choose the size and auto-scaling characteristics of the flow:

 

Native monitoring and alerting

Custom KPIs can be defined to monitor the aspects of the flow that are important to you. Alerts can be also defined to generate notifications when the configured thresholds are crossed:

After the deployment the metrics collected for the defined KPI can be monitored on the CDF dashboard:

Cloudera DataFlow also provides direct access to the NiFi canvas for the flow so that you can check details of the execution or troubleshoot issues, if necessary. All the functionality from the GUI is also available programmatically, either through the CDP CLI or the CDF API. The process of creating and managing flow can be fully automated and integrated with CD/CI pipelines.

Conclusion

Collecting data at the point of origination as it gets generated, and quickly making it available on the analytical platform, is critical for the success of any project that requires data streams to be processed in real time. In this blog we showed how Cloudera DataFlow makes it easy to create, test, and deploy data pipelines in the cloud.

Apache NiFi’s graphical user interface and richness of processors allows users to create simple and complex data flows without having to write code. The interactive experience makes it very easy to test and troubleshoot flows during the development process.

Cloudera DataFlow’s flow runtime adds robustness and efficiency to the execution of the flows in production in a cloud-native and elastic environment, which enables it to expand and shrink to accommodate the workload demand.

In the part two of this blog we will look at how Cloudera Stream Processing (CSP) can be used to complete the implementation of our fraud detection use case, performing real-time streaming analytics on the data that we have just ingested.

What’s the fastest way to learn more about Cloudera DataFlow and take it for a spin? First, visit our new Cloudera DataFlow home page. Then, take our interactive product tour or sign up for a free trial

André Araújo
Data in Motion Field Engineer
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.