Data analysis techniques for fraud detection

From Wikipedia, the free encyclopedia
Jump to navigation Jump to search

Fraud is a billion-dollar business and it is increasing every year. The PwC global economic crime survey of 2018[1] found that half (49 percent) of the 7,200 companies they surveyed had experienced fraud of some kind. This is an increase from the PwC 2016 study in which slightly more than a third of organizations surveyed (36%) had experienced economic crime.

Fraud possibilities co-evolve with technology, esp. Information technology[2] Business reengineering, reorganization or downsizing may weaken or eliminate control, while new information systems may present additional opportunities to commit fraud.

Traditional methods of data analysis have long been used to detect fraud. They require complex and time-consuming investigations that deal with different domains of knowledge like financial, economics, business practices and law. Fraud often consists of many instances or incidents involving repeated transgressions using the same method. Fraud instances can be similar in content and appearance but usually are not identical.[3]

The first industries to use data analysis techniques to prevent fraud were the telephone companies, the insurance companies and the banks (Decker 1998). One early example of successful implementation of data analysis techniques in the banking industry is the FICO Falcon fraud assessment system, which is based on a neural network shell.

Retail industries also suffer from fraud at POS. Some supermarkets have started to make use of digitized closed-circuit television (CCTV) together with POS data of most susceptible transactions to fraud.

Fraud that involves cell phones, insurance claims, tax return claims, credit card transactions etc. represent significant problems for governments and businesses, but yet detecting and preventing fraud is not a simple task. Fraud is an adaptive crime, so it needs special methods of intelligent data analysis to detect and prevent it. These methods exist in the areas of Knowledge Discovery in Databases (KDD), Data Mining, Machine Learning and Statistics. They offer applicable and successful solutions in different areas of fraud crimes.

In general, the primary reason to use data analytics techniques is to tackle fraud since many internal control systems have serious weaknesses. In order to effectively test, detect, validate, correct error and monitor control systems against fraudulent activities, businesses entities and organizations rely on specialized data analytics techniques such as data mining, data matching, sounds like function, Regression analysis, Clustering analysis and Gap.[4] Techniques used for fraud detection fall into two primary classes: statistical techniques and artificial intelligence.[3] Examples of statistical data analysis techniques are:

  • Data preprocessing techniques for detection, validation, error correction, and filling up of missing or incorrect data.
  • Calculation of various statistical parameters such as averages, quantiles, performance metrics, probability distributions, and so on. For example, the averages may include average length of call, average number of calls per month and average delays in bill payment.
  • Models and probability distributions of various business activities either in terms of various parameters or probability distributions.
  • Computing user profiles.
  • Time-series analysis of time-dependent data.
  • Clustering and classification to find patterns and associations among groups of data.
  • Data matching Data matching is used to compare two sets of collected data. The process can be performed based on algorithms or programmed loops. Trying to match sets of data against each other or comparing complex data types. Data matching is used to remove duplicate records and identify links between two data sets for marketing, security or other uses.[4]
  • Sounds like Function is used to find values that sound similar. The Phonetic similarity is one way to locate possible duplicate values, or inconsistent spelling in manually entered data. The ‘sounds like’ function converts the comparison strings to four-character American Soundex codes, which are based on the first letter, and the first three consonants after the first letter, in each string.[4]
  • Regression analysis allows you to examine the relationship between two or more variables of interest. Regression analysis estimates relationships between independent variables and a dependent variable. This method can be used to help understand and identify relationships among variables and predict actual results.[4]
  • Gap analysis is used to determine whether business requirements are being met, if not, what are the steps that should be taken to meet successfully. Referring to the article “Fraud analysis techniques using ACL” [6] gap refers to the space between "where we are" (the present state) and "where we want to be" (the target state).[4]
  • Matching algorithms to detect anomalies in the behavior of transactions or users as compared to previously known models and profiles. Techniques are also needed to eliminate false alarms, estimate risks, and predict future of current transactions or users.

Some forensic accountants specialize in forensic analytics which is the procurement and analysis of electronic data to reconstruct, detect, or otherwise support a claim of financial fraud. The main steps in forensic analytics are (a) data collection, (b) data preparation, (c) data analysis, and (d) reporting. For example, forensic analytics may be used to review an employee's purchasing card activity to assess whether any of the purchases were diverted or divertible for personal use. Forensic analytics might be used to review the invoicing activity for a vendor to identify fictitious vendors, and these techniques might also be used by a franchisor to detect fraudulent or erroneous sales reports by the franchisee in a franchising environment.[5]

Fraud detection is a knowledge-intensive activity. The main AI techniques used for fraud detection include:

  • Data mining to classify, cluster, and segment the data and automatically find associations and rules in the data that may signify interesting patterns, including those related to fraud.
  • Expert systems to encode expertise for detecting fraud in the form of rules.
  • Pattern recognition to detect approximate classes, clusters, or patterns of suspicious behavior either automatically (unsupervised) or to match given inputs.
  • Machine learning techniques to automatically identify characteristics of fraud.
  • Neural networks that can learn suspicious patterns from samples and used later to detect them.

Other techniques such as link analysis, Bayesian networks, decision theory, and sequence matching are also used for fraud detection.[3] A new and novel technique called System properties approach has also been employed where ever rank data is available. [6]

Statistical analysis of research data is the most comprehensive method for determining if data fraud exists. Data fraud as defined by the Office of Research Integrity (ORI) includes fabrication, falsification and plagiarism. The first peer reviewed published paper on data fraud is currently being reviewed. Once published, it will become the standard for scientific and legal/court analysis. The statistical work was performed by Drs. Mark S. Kaiser and Alicia L. Carriquiry of Iowa State University and Dr. Gordon M Harrington of the University of Northern Iowa, where they showed that data thought to be fabricated [HI data] was in fact real, while another set of data [Hansen data] was reported to the statisticians as being fabricated was in fact falsified and plagiarized from the HI data set. <Fleming RM, Fleming MR, Chaudhuri TK. Establishing Data Validity: Statistically Determining if Data is Fabricated, Falsified or Plagiarized 2019 ACTA Sci Med Sci>


The younger companies in the fraud prevention space tend to rely on systems that have been based around machine learning, rather than later incorporating machine learning into an existing system. These companies include Featurespace[7]Zensed,[8] Feedzai, Cybersource, Stripe,,[9] SecurionPay,[10] Forter, Sift Science,[11] Signifyd,[12] Riskified,[13] Experian and ThirdWatch.[14] However, multiple security concerns have been raised about how such solutions collect signals for fraud detection and how they are being deployed.[15]

Machine learning and data mining[edit]

Early data analysis techniques were oriented toward extracting quantitative and statistical data characteristics. These techniques facilitate useful data interpretations and can help to get better insights into the processes behind the data. Although the traditional data analysis techniques can indirectly lead us to knowledge, it is still created by human analysts.[16]

To go beyond, a data analysis system has to be equipped with a substantial amount of background knowledge, and be able to perform reasoning tasks involving that knowledge and the data provided.[16] In effort to meet this goal, researchers have turned to ideas from the machine learning field. This is a natural source of ideas, since the machine learning task can be described as turning background knowledge and examples (input) into knowledge (output).

If data mining results in discovering meaningful patterns, data turns into information. Information or patterns that are novel, valid and potentially useful are not merely information, but knowledge. One speaks of discovering knowledge, before hidden in the huge amount of data, but now revealed.

The machine learning and artificial intelligence solutions may be classified into two categories: 'supervised' and 'unsupervised' learning. These methods seek for accounts, customers, suppliers, etc. that behave 'unusually' in order to output suspicion scores, rules or visual anomalies, depending on the method.[2]

Whether supervised or unsupervised methods are used, note that the output gives us only an indication of fraud likelihood. No stand alone statistical analysis can assure that a particular object is a fraudulent one, but they can identify them with very high degrees of accuracy.

Supervised learning[edit]

In supervised learning, a random sub-sample of all records is taken and manually classified as either 'fraudulent' or 'non-fraudulent'. Relatively rare events such as fraud may need to be over sampled to get a big enough sample size.[17] These manually classified records are then used to train a supervised machine learning algorithm. After building a model using this training data, the algorithm should be able to classify new records as either fraudulent or non-fraudulent.

Supervised neural networks, fuzzy neural nets, and combinations of neural nets and rules, have been extensively explored and used for detecting fraud in mobile phone networks and financial statement fraud.[18][19]

Bayesian learning neural network is implemented for credit card fraud detection, telecommunications fraud, auto claim fraud detection, and medical insurance fraud.[20]

Hybrid knowledge/statistical-based systems, where expert knowledge is integrated with statistical power, use a series of data mining techniques for the purpose of detecting cellular clone fraud. Specifically, a rule-learning program to uncover indicators of fraudulent behaviour from a large database of customer transactions is implemented.[21]

Cahill et al. (2000) design a fraud signature, based on data of fraudulent calls, to detect telecommunications fraud. For scoring a call for fraud its probability under the account signature is compared to its probability under a fraud signature. The fraud signature is updated sequentially, enabling event-driven fraud detection.

Link analysis comprehends a different approach. It relates known fraudsters to other individuals, using record linkage and social network methods.[22][23]

This type of detection is only able to detect frauds similar to those which have occurred previously and been classified by a human. To detect a novel type of fraud may require the use of an unsupervised machine learning algorithm.

Unsupervised learning[edit]

In contrast, unsupervised methods don't make use of labelled records.

Some important studies with unsupervised learning with respect to fraud detection should be mentioned. For example, Bolton and Hand[24] use Peer Group Analysis and Break Point Analysis applied on spending behaviour in credit card accounts. Peer Group Analysis detects individual objects that begin to behave in a way different from objects to which they had previously been similar. Another tool Bolton and Hand[24] develop for behavioural fraud detection is Break Point Analysis. Unlike Peer Group Analysis, Break Point Analysis operates on the account level. A break point is an observation where anomalous behaviour for a particular account is detected. Both the tools are applied on spending behaviour in credit card accounts.

Also, Murad and Pinkas[25] focus on behavioural changes for the purpose of fraud detection and present three-level-profiling. Three-level-profiling method operates at the account level and points to any significant deviation from an account's normal behaviour as a potential fraud. In order to do this, 'normal' profiles are created based on data without fraudulent records (semi supervised). In the same field, also Burge and Shawe-Taylor[26] use behaviour profiling for the purpose of fraud detection. However, using a recurrent neural network for prototyping calling behaviour, unsupervised learning is applied.

Cox et al.[27] combines human pattern recognition skills with automated data algorithms. In their work, information is presented visually by domain-specific interfaces, combining human pattern recognition skills with automated data algorithms (Jans et al.).

See also[edit]


  1. ^ Lavion, Didier; et al. "PwC's Global Economic Crime and Fraud Survey 2018" (PDF). Retrieved 28 August 2018.
  2. ^ a b Cite error: The named reference bolton_2002 was invoked but never defined (see the help page).
  3. ^ a b c Cite error: The named reference palshikar_2002 was invoked but never defined (see the help page).
  4. ^ a b c d e Cite error: The named reference English302gmu was invoked but never defined (see the help page).
  5. ^ Nigrini, Mark (June 2011). "Forensic Analytics: Methods and Techniques for Forensic Accounting Investigations". Hoboken, NJ: John Wiley & Sons Inc. ISBN 978-0-470-89046-2.
  6. ^ Vani, G. K. (February 2018). "How to detect data collection fraud using System properties approach". Multilogic in Science. ISSN 2277-7601. Retrieved February 2, 2019.
  7. ^ "Adaptive behavioural analytics". 2017-10-13. Retrieved 2017-10-13.
  8. ^ "Artificial intelligence fraud prevention for eCommerce". 2017-08-06. Retrieved 2017-06-06.
  9. ^ "Fraud Prevention for Ecommerce, Travel and Financial Enterprises". 2015-04-17. Retrieved 2017-06-06.
  10. ^ "Online Payments with built-in machine learning capabilities". Retrieved 2017-10-12.
  11. ^ "Fraud Prevention Software & Chargeback Protection". Sift Science. Retrieved 2017-06-06.
  12. ^ "Fraud Protection & Chargeback Prevention for eCommerce". 2017-02-22. Retrieved 2017-06-06.
  13. ^ "Archived copy". Archived from the original on 2017-06-03. Retrieved 2017-06-01. Cite uses deprecated parameter |deadurl= (help)CS1 maint: archived copy as title (link)
  14. ^ "Reduce RTO, AI Based Fraud Prevention & Chargeback Protection". ThirdWatch. Retrieved 2017-11-08.
  15. ^ Mayank Dhiman Breaking Fraud & Bot Detection Solutions OWASP AppSec Cali' 2018 Retrieved February 10, 2018.
  16. ^ a b Cite error: The named reference michalski_1998 was invoked but never defined (see the help page).
  17. ^ Cite error: The named reference dal2014learned was invoked but never defined (see the help page).
  18. ^ Cite error: The named reference green_1997 was invoked but never defined (see the help page).
  19. ^ Cite error: The named reference estevez_2006 was invoked but never defined (see the help page).
  20. ^ Bhowmik, Rekha Bhowmik. "35 Data Mining Techniques in Fraud Detection". Journal of Digital Forensics, Security and Law. University of Texas at Dallas.
  21. ^ Cite error: The named reference fawcett_1997 was invoked but never defined (see the help page).
  22. ^ Cite error: The named reference phua_2005 was invoked but never defined (see the help page).
  23. ^ Cite error: The named reference cortes_2002 was invoked but never defined (see the help page).
  24. ^ a b Cite error: The named reference bolton_2001 was invoked but never defined (see the help page).
  25. ^ Cite error: The named reference murad_1999 was invoked but never defined (see the help page).
  26. ^ Cite error: The named reference burge_2001 was invoked but never defined (see the help page).
  27. ^ Cite error: The named reference cox_1997 was invoked but never defined (see the help page).

refs= [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]


  1. ^ Bolton, R. & Hand, D. (2002). Statistical Fraud Detection: A Review (With Discussion). Statistical Science 17(3): 235–255.
  2. ^ Bolton, R. & Hand, D. (2001). Unsupervised Profiling Methods for Fraud Detection. Credit Scoring and Credit Control VII.
  3. ^ G.K. Palshikar, The Hidden Truth – Frauds and Their Control: A Critical Application for Business Intelligence, Intelligent Enterprise, vol. 5, no. 9, 28 May 2002, pp. 46–51.
  4. ^ Michalski, R. S., I. Bratko, and M. Kubat (1998). Machine Learning and Data Mining – Methods and Applications. John Wiley & Sons Ltd.
  5. ^ Phua, C., Lee, V., Smith-Miles, K. and Gayler, R. (2005). "A Comprehensive Survey of Data Mining-based Fraud Detection Research" (PDF). Clayton School of Information Technology, Monash University.CS1 maint: multiple names: authors list (link)
  6. ^ Green, B. & Choi, J. (1997). Assessing the Risk of Management Fraud through Neural Network Technology. Auditing 16(1): 14–28.
  7. ^ Estevez, P., C. Held, and C. Perez (2006). Subscription fraud prevention in telecommunications using fuzzy rules and neural networks. Expert Systems with Applications 31, 337–344.
  8. ^ Fawcett, T. (1997). AI Approaches to Fraud Detection and Risk Management: Papers from the 1997 AAAI Workshop. Technical Report WS-97-07. AAAI Press.
  9. ^ Cortes, C. & Pregibon, D. (2001). Signature-Based Methods for Data Streams. Data Mining and Knowledge Discovery 5: 167–182.
  10. ^ Burge, P. & Shawe-Taylor, J. (2001). "An Unsupervised Neural, Network Approach to Profiling the Behaviour of Mobile Phone Users for Use in Fraud Detection". Journal of Parallel and Distributed Computing. 61: 915–925. doi:10.1006/jpdc.2000.1720.CS1 maint: multiple names: authors list (link)
  11. ^ Cox, K., Eick, S. & Wills, G. (1997). "Visual Data Mining: Recognising Telephone Calling Fraud". Data Mining and Knowledge Discovery. 1: 225–231. doi:10.1023/A:1009740009307.CS1 maint: multiple names: authors list (link)
  12. ^ Murad, U. & Pinkas, G. (1999). Unsupervised Profiling for Identifying Superimposed Fraud. Proceedings of PKDD'99.
  13. ^ Dal Pozzolo, A. & Caelen, O. & Le Borgne, Y. & Waterschoot, S. & Bontempi, G. (2014). Learned lessons in credit card fraud detection from a practitioner perspective. Expert systems with applications 41: 10 4915--4928.
  14. ^ zakaria, Bolton, R. and D. Hand, Statistical fraud detection: A review. Statistical Science 17 (3), pp. 235-255, 2002. ACL, 2014