Detecting Financial Fraud Using Machine Learning


For years, criminals simply copied numbers printed on debit or credit cards to use in physical stores. But since 2015, both Mastercard and Visa have forced every merchant and bank to introduce EMV – a chip technology for cards that allowed merchants to request a PIN number for every transaction.

Even so, experts predict that frauds with credit card would be $32 billion in 2020.

To explore it more, this sum is higher than the profits released in 2017 by some traditional global companies, for example Coca-Cola (2 billion), Warren Buffet’s Berkshire Hathaway (24 billion) and JP-Morgan Chase (23.5 billion).

Additionally,the implantation of chip technology on cards, companies have also invested massively in other fraud detection machine learning technologies.

Can Machine Learning and Artificial Intelligence become valuable partners in this fight?


Classification Problems

Regarding Machine Learning, issues such as fraud detection generally are considered classification problems – predicting the class corresponding to a given observation. The examples of these problems are Spam Detector, Recommendation Systems and Default Prediction.

With regard to the detection of credit card fraud, the classification problems require the creation of models with sufficient intelligence to properly organize transactions as being legitimate or fraudulent, according to details for example value, establishment, location, date and hour, among others.

Much money is still lost in financial fraud. Hackers and swindlers are always on the lookout for new techniques and scams. The exclusive use of conventional financial fraud detection systems, based on previously defined rules, is not an adequate response to the dynamics of the problem. It is precisely in this aspect that our data science services stands out as a distinct solution for these types of problems.


The key challenge we face, when establishing the classification problems for fraud detection, lies in a fact that the large number of transactions are not fraudulent. This should come as no surprise, given that investment in new fraud detection technologies has grown over the years, but it does pose a problem: unbalanced data.

A popular method for handling unbalanced data is over-sampling. It means artificially creating new observations in the data set is under-represented.

What do you do?

Now let’s address an example of fraud detection in practice, learning how to get over the limitation imposed by unbalanced data.

One data set contains transactions carried out by European credit card holders in September 2013. These data correspond to transactions that took place over two days, with 492 frauds in 284,607 transactions. This data is highly unbalanced, with the fraud cases representing only 0.172% of these transactions.

Note that, for reasons of confidentiality, the data has been anonymized – the variables have been given new name to V1, V2, V3 through V28. In addition, most were rescheduled, with the exception of the Value and variables of Class, the last corresponding to the binary response variable.

It is always a good idea to do an EDA – Exploratory Data Analysis before you start working on forecasting and analysis models. But, as in this particular case, most variables do not add much context, since they have been anonymized.