To design effective fraud-detection architecture, look no further than the human brain (with some help from Spark Streaming and Apache Kafka).
At its core, fraud detection is about detection whether people are behaving “as they should,” otherwise known as catching anomalies in a stream of events. This goal is reflected in diverse applications such as detecting credit-card fraud, flagging patients who are doctor shopping to obtain a supply of prescription drugs, or identifying bullies in online gaming communities.
To understand how to design an effective fraud-detection architecture, one need to examine how the human brain learns to detect anomalies and react to them. As it turns out, our brains have multiple systems for analyzing information. (If you are interested in how humans process information, we recommend the book Thinking Fast and Slow, by Daniel Kahneman).
Consider a game of tennis, where the players have to detect the arriving ball and react to it appropriately. During the game, players have to perform this detection-reaction loop incredibly fast; there is no time to think, and reactions are based on instinct and very fast pattern detection. Between sets, a player may have a moment or two to reflect upon the game’s progress, identify tactics or strategies the other player is using, and make adjustments accordingly. Between games, the players have much time for reflection: they may notice that a particular aspect of their game is consistently weak and work on improving it, so during the next game they can instinctively perform better. (Note that this reflection can be conscious or unconscious. We have all occasionally spent hours trying to solve a challenging problem, only for the solution to materialize during the morning shower while we are thinking of nothing in particular.)
Combination of Systems
In a similar fashion, effective fraud-detection architecture emulates the human brain by having three subsystems work together to detect anomalies in streams of events:
Near real-time system: the job of this system is to receive events and reply as fast as possible, usually in less than 100ms. This system typically does very little processing and depends mostly on pattern matching and applying predefined rules. Architectural design should focus on achieving very high throughput with very low latency, and to this end, use patterns such as caching of user profiles in local memory.
Stream-processing systems: this system can take a little longer to process the incoming data, but should still process each even within few seconds to few minutes of its arrival. The goal of this system is to adjust parameters of the fraud-detection models in near real-time, using data aggregated across all user activity (for example, flagging vendors or regions that are currently more suspicious).
Offline-processing system: this system can run in anything from hours to months latency and focusing on improving the models themselves. This process includes training the models on new data, exploring new features in the data, and developing new models. It also requires that human data analysts explore the data using BI tools.
In a previous blog post, we explored different patterns of how the first type of system, near real-time, can be implemented and covered some reasons that Cloudera recommends Spark Streaming for the stream-processing system. To recap, the recommended architecture for the real-time reaction system is a service that subscribes as a consumer to an Apache Kafka topic with the events to which reactions are required. The service uses cached state and rules to react quickly to these events and uses Apache HBase as an external context. Kafka partitions can be used to distribute the service and ensure that each instance only needs to cache information for a subset of the users.
The beauty of this plan is that this distributed application is self-contained and can be managed in many ways – as an Apache Flume interceptor, as a YARN container, or with Mesos, Docker, Kubernetes, and many other distributed system container frameworks. You can pick whatever you prefer, since Kafka will do the data persistence and partitioning work.
Now, let’s see how to integrate the real-time part of the system with the stream- and offline-processing portions.
Integrating Real-Time Detection and Processing
The key to the integration is the use of Kafka as a scalable, ordered event storage. When registering consumers with Kafka, you can subscribe consumers either as part of the same consumer group or in separate consumer groups. If two consumers are subscribed to read events from a topic as part of the same group, they will each “own” a subset of the partitions in the group, and each one will only get events from the partitions it owns. If a consumer in the group crashes, its partitions will be distributed across other consumers in the group. This approach provides a mechanism for both load balancing and high availability of consumers, and it makes each data processing application scale (by adding more consumers to the same group as load increases).
But we also want multiple applications reading the same data; both the real-time app and the streaming app need to read the same data from Kafka. In this case, each application will be its own consumer group and will be able to consume messages from Kafka independently at its own pace.
To enable offline processing by batch jobs and analysts, the data needs to be stored in HDFS. You can easily do this using Flume; just define Kafka as the channel and HDFS as the sink. In this setup, Flume will read events from Kafka and write them to HDFS, HBase, or Apache Solr where they can be accessed by Apache Spark, Impala, Apache Hive, and other BI tools.
Note that since each system is subscribed with its own consumer group to Kafka, they can each read events independently at their own rate. Thus, if the stream-processing system is taking longer to process events, it has no impact on the real-time system. Part of the beauty of Kafka is that it stores events for a set amount of time, regardless of how many consumers and what they do, so Kafka will keep performing the same even as you add more processing systems.
The last part of the integration is the ability to send rule and model updates from the stream- and offline-processing systems back to the real-time system. This process is the equivalent of improving human instincts based on practice (changing the threshold of approved transactions sized for a particular vendor, for example).
One approach is to have these systems update the models in HBase, and have the real-time system occasionally (and asynchronously check HBase for updates). A better option is to send the model updates to another Kafka topic: Our real-time app will subscribe to that topic and when updates show up, it will apply them to its own rule cache and modify its behavior accordingly.
An interesting design option here can be to store the models completely in Kafka, because its compaction feature can ensure that regardless of how long you choose to store “raw” data in Kafka, the latest numbers for each model will be stored forever and can always be retrieved and cached by the real-time application.
We hope that this post shed some light on a very challenging topic. For hands-on instruction, sign on for our fraud-detection tutorial at Strata – Hadoop World NYC 2015.
Gwen Shapira is a Software Engineer at Cloudera, and a committer on Apache Sqoop and Apache Kafka. She has 15 years of experience working with customers to design scalable data architectures, and is a co-author of the O’Reilly book, Hadoop Application Architectures.
Ted Malaska is a Solutions Architect at Cloudera, a contributor to Spark, Flume, and HBase, and also a co-author of Hadoop Application Architectures.