One of the many areas where machine learning has made a large difference for enterprise business is in the ability to make accurate predictions in the realm of fraud detection. Knowing that a transaction is fraudulent is a critical requirement for financial services companies, but knowing that a transaction that was flagged by a rules-based system as fraudulent is a valid transaction, can be equally important. There is a cost associated with an intervention into a transaction that is incorrectly flagged as fraud and it can erode customer trust. Consumers could be concerned if there are too many false positives for fraud on their accounts.
An approach that I have seen our customers adopt is to add a machine learning model after the rules-based system to further categorize the transactions flagged as fraudulent to remove more of the false positives. A well-tuned, accurate model can predict which are the false positives and reduce the follow-up costs and improve customer confidence dramatically.
The approach to machine learning using deep learning has brought marked improvements in the performance of many machine learning domains and it can apply just as well to fraud detection. Fraud detection has a large imbalance between the number of valid vs fraudulent transactions which makes the traditional supervised machine learning approaches less capable. An alternative is to introduce an anomaly detection based approach: find the pattern in the valid transactions and flag the transactions that don’t fit that pattern as potentially fraudulent.
The research team at Cloudera Fast Forward have written a report on using deep learning for anomaly detection. It covers many of the technical and practical requirements of this approach to anomaly detection and you can read more about it here: https://ff12.fastforwardlabs.com/
As part of Cloudera’s ongoing product enhancement efforts, we are creating Applied Machine Learning Prototypes that will deploy a complete sample machine learning project into your CML/CDSW instance. The second Applied Machine Learning Prototype that was made available is for building a fraud detection model.
These are prototypes that will help you build a fully working machine learning example in CML. The Templates will include source data and walk through various steps:
- Ingest data into a useful place in CDP (e.g. a Hive Table)
- Explore the data set
- Create a plan to build a model
- Train the model
- Deploy the model
- Build and deploy an application
Once you have deployed the template and all the CML artifacts that go with it, you can unpick and work it backward to map the process to your own data in your own environment.
These prototypes will follow a similar workflow which is illustrated in the picture below.
Once the prototype has been completely deployed, you will have an application that is able to make predictions to classify transactions as fraudulent or not:
The data for this is the widely used credit card fraud dataset. It’s not the raw transaction data but rather an anonymized feature set based on a principle component analysis (PCA) of the original data. The data and the techniques presented in this prototype are still applicable as creating a PCA feature store is often part of the machine learning process.
The process followed in this prototype covers several steps that you should follow:
- Data Ingest – move the raw data to a more suitable storage location
- Data analysis – create a plan to build the model
- Model training – train the model based on the plan
- Model deployment – put the model live and into production
- Application deployment – deploy an application that interacts with the model
The model that you will build when going through for this prototype is an autoencoder. From the Cloudera Fast Forward report:
Autoencoders are neural networks designed to learn a low-dimensional representation, given some input data. They consist of two components: an encoder that learns to map input data to a low-dimensional representation (termed the bottleneck), and a decoder that learns to map this low-dimensional representation back to the original input data. By structuring the learning problem in this manner, the encoder network learns an efficient “compression” function that maps input data to a salient lower-dimensional representation, such that the decoder network is able to successfully reconstruct the original input data. The model is trained by minimizing the reconstruction error, which is the difference (mean squared error) between the original input and the reconstructed output produced by the decoder. In practice, autoencoders have been applied as a dimensionality reduction technique, as well as in other use cases such as noise removal from images, image colorization, unsupervised feature extraction, and data compression.
Enabling Better Fraud Detection For Your Business
Whether you’re just starting your journey with machine learning or looking to ramp up your fraud detection operation, ensuring you are employing the latest and greatest approaches and technologies is key for staying competitive and better serving your customers. With Cloudera Fast Forward research and the Fraud Detection Applied Machine Learning Prototype in CML, your business can be ready for the challenge. To deploy the prototype into your CML/CDSW instance, you can follow the instructions here, or check out the open Github repository.