Buzz

Machine Learning in AML & Fraud Prevention

What is the reality when the sales pitch is over?

Detecting and preventing fraud or money laundering is a never ending battle fought on a constantly shifting battleground. The criminals regularly change their behaviour to avoid automated detection and the operators of monitoring systems must promptly discover and react to the new fraud patterns. In recent years, Artificial Intelligence (AI) has been taking the industry by storm. Machine Learning (ML) classifiers that automatically learn patterns from existing data and then apply them to classify new data samples have emerged as a viable alternative to fixed programming in the form of expert-developed rulesets and scorecards.

Their successful application to solving hard problems previously thought almost impossible for computers has prompted many vendors of machine learning solutions to seek entry into the market of fraud detection and prevention. Unfortunately, many of them underestimate the specifics of the various kinds of fraud and advertise almost magical qualities of their solutions. Thus, navigating the ecosystem of machine learning-based AML and fraud prevention systems can be hard.

No matter which provider one chooses, the data-driven nature of Machine Learning presents some unique challenges, especially for businesses that haven’t dealt with Machine Learning before. Here is some insight from Hristo Iliev, Chief Data Scientist at NOTO on how to navigate those challenges.

Got data? Adequate training data?

Machine learning models can only be trained to recognise patterns that already exist in the data, which puts a strong quality requirement on the training and validation data sets. They must contain the right balanced and unbiased mix of representative samples of both genuine and bad transactions. Balancing is hard to achieve with fraud and AML data which by nature is highly skewed in favour of genuine transactions.

Another challenge, especially for small or new businesses that still do not have an extensive database of recorded transactions, is providing large enough data sets for training and validation. Models trained on insufficient data will overfit and will not generalise well to transactions outside the training set. For that reason, some machine learning solution providers have started offering pre-trained models.

While such models can be appealing from a business perspective since they save the burden of gathering the initial data, they might contain hidden biases. That alone can make one liable under the terms of the EU’s General Data Protection Regulation (GDPR) and its upcoming sister regulation of high-risk systems employing AI. The latter puts very strong requirements on the quality of data used to train models employed in financial systems. But, what is even more important, fraud patterns and AML risk exposure vary widely between business sectors and even between businesses in the same sector. Proper normalisation is needed in order to obtain neutral features, which again requires access to sufficient amounts of historical data.

The black box problem

Learning from data presents another challenge – lack of clear explainability of how the models actually work. Unlike explicitly programmed systems that follow predefined logic steps, machine learning solutions learn to approximate existing patterns in the data by adjusting various opaque internal model parameters during the training phase. That makes it hard to reason about why they produce certain output in any given case. Although explainable models such as decision trees exist, they are rarely employed directly due to their poor ability to generalise. Instead, ensembles of hundreds or thousands of decision trees where each tree corrects the mistakes of the previous ones, take for example the popular XGBoost, are used and they aren’t nearly as transparent as a single decision tree is.

Explainability is crucial for businesses operating from within the EU or serving EU customers. GDPR and the AI regulation mandate that one should be able to explain decisions made by automated systems. In similar regimes, reliance on black box machine learning solutions could result in long-running inquiries by the respective regulatory bodies, especially during the initial implementation phases of the new regulation when the reviewers are still gaining experience.

Feedback and human oversight

ML-based anti-fraud and AML solutions gain knowledge from the training data and apply it to make predictions about the fraudulent status of new and previously unseen transactions but they do not ultimately know if those predictions turn out to be true or not. It is up to a human user to determine the truth and to provide it back to the machine learning system in the form of a feedback loop.

With that information available, model performance can be evaluated and compared against preset benchmarks such as the acceptable false positives ratio. In case of significant underperformance, the model is no longer adequate to the current pattern of fraudulent activity and needs to be retrained or replaced by an entirely different model. Feedback is a really crucial element in any machine learning system and if a product advertises ML capabilities but there is no clear way to provide feedback, that is likely a case of a fixed model or some other opaque mechanism is used to update the model knowledge.

The quality of the feedback is very important. If one simply loops back at all the reported fraud, chargebacks or other negative data without any human oversight, that will both lead to wrong estimates of the model performance and to models with inflated false positives. As with any other data-driven solution, ML is governed by the GIGO principle–“garbage in, garbage out”. Therefore, one should be prepared to invest in human fraud and AML analysts, even when employing an ML solution. This is also mandated by the upcoming EU regulation on high-risk AI systems.

The way forward – hybrid solutions

Even when it is possible to train and retrain ML solutions with custom data, it can still be hard to teach them to incorporate fixed logic which may be needed for regulatory compliance or when addressing simple well-known fraud patterns. In such cases, one can augment the ML model with a set of rules or embed it in an existing rule-based system creating a hybrid solution.

Hybrid solutions have a number of practical advantages over pure ML-based ones:

  • It is easy to start using them right away with simple rules, even if there is not enough data available, then gradually add ML on top of them as data accumulates.
  • ML model decisions can be easily overridden and special cases can be quickly added without retraining the model.
  • It is possible to separate features containing sensitive information and process them with clear hand-written rules.

Instead of an epilogue

Certainly there are a ton of considerations that must be made before choosing one solution over the other, especially if you are looking to find the best fit for your organization. However here is our top 3:

  • Human expertise in detecting fraud and AML is not yet a thing of the past. If anyone knows your business inside and out, it is you.
  • Machine learning and AI solutions are certainly the next step but they are not as autonomous as we would like them to be. You will not find a silver bullet in that niche yet.
  • Be realistic about your current state of operations – do you have enough volume to effectively use machine learning and AI solutions? How much suspicious activity do you have, and how well is your data structured? All these seemingly minor items do matter. A lot!
Viewed 274 times / 1 views today
Tagged with:
Posted in: ,


Author: Hristo Iliev

Hristo is the Chief Data Scientist at NOTO. He is a high-performance computing specialist and data science enthusiast with a doctoral degree in atomic and molecular physics from the University of Sofia, software developer, and systems administrator. He specialises in data analytics and predictive modelling at scale, development and performance tuning of parallel applications, and is currently the only Stack Overflow user to hold at once golden badges for the two major parallel programming paradigms - OpenMP and the Message Passing Interface.