Have a personal or library account? Click to login
Comparative study of deep learning explainability and causal ai for fraud detection Cover

Comparative study of deep learning explainability and causal ai for fraud detection

Open Access
|Aug 2024

Full Article

I.
Introduction

Fraudulent activities such as identity theft, credit card fraud, insurance fraud, and cybercrime have become increasingly common in various industries [1]. These illicit activities not only cost huge amounts of money to businesses but also damage their reputation and erode customer confidence. The economic impact of fraud can be staggering, with businesses of all sizes collectively losing billions of dollars annually [1]. Traditional fraud detection methods are mainly based on patterns or abnormalities found in datasets. These approaches involve establishing rules and thresholds to define deviations from expected behavior [2]. While effective to a certain extent, these methods have inherent limitations. They often struggle to distinguish between a mere correlation and an actual causal relationship [3]. In other words, some activities may be identified as fraudulent on the basis of statistical anomalies but the root cause of those anomalies remains challenging.

The use of deep learning models in fraud detection has greatly increased accuracy and has made it possible to identify intricate patterns indicative of fraudulent activity [4]. However, the intrinsic opacity of these models makes it difficult to comprehend how decisions are made, which raises questions about how reliable and interpretable they are [5]. This study begins a thorough investigation of the interpretability of deep learning models for fraud detection. The main goal is to evaluate and compare different interpretability methods used on well-known deep learning architectures, in order to decipher the complex decision logic behind them.

To validate and establish trust in the results of these models, stakeholders, regulators, and end users must comprehend the elements that influence fraud predictions [5]. By evaluating the efficacy of strategies that include feature importance analysis, layer-wise relevance propagation, and model-agnostic methods across several deep learning architectures, this work seeks to close the interpretability gap.

In response to these issues, the development of advanced artificial intelligence (AI) approaches has provided promising solutions. Notably, incorporating causal artificial intelligence (Causal AI) with fraud detection marks a substantial advancement in the field. Causal AI goes beyond typical machine learning (ML) methodologies by identifying causal correlations within data, resulting in a better understanding of the underlying mechanisms that drive fraudulent behavior. Causal AI has the ability to revolutionize fraud detection by not only recognizing abnormalities but also determining the core reasons for fraudulent activity.

The theoretical discussion centers on the necessity for a balance between accuracy and interpretability in fraud detection. While deep learning algorithms have remarkable accuracy in detecting fraudulent actions, their opacity makes it difficult to explain and validate their conclusions. This paper seeks to close this gap by critically assessing existing approaches and providing strategies for improving the transparency and efficacy of fraud detection systems.

This study contributes to the continuing topic of enhancing the transparency and reliability of fraud detection systems by assessing the current state of the art in fraud detection methodologies. The study intends to advance the field of fraud detection by identifying areas for improvement and providing actionable insights that will lead to the creation of more interpretable and effective algorithms. Specifically, the study compares deep learning explainability (DLE) to Causal AI in the setting of fraud detection, an area that has received little attention.

II.
Background

Fraudulent activities represent a serious threat to a number of industries, including e-commerce, healthcare, and banking. The intricacy and refinement of fraudulent schemes have made it increasingly difficult for conventional rule-based and statistical models to precisely recognize new trends [6]. By utilizing complex associations found in large datasets, deep learning models—neural networks and recurrent neural networks (RNNs), in particular—have completely changed the field of fraud detection [7]. However, issues with the interpretability of these models have emerged as a result of their extensive use. Because deep learning architectures are opaque and are sometimes referred to as “black boxes,” it is challenging for stakeholders to validate and trust the results because it is difficult to understand how decisions are made [8].

Over the years, fraudulent activities have become more sophisticated and widespread, affecting areas such as finance, healthcare, insurance, and e-commerce The economic impact of fraud runs into millions of dollars more each year. Traditional fraud detection methods have long relied on rule-based systems, statistical analysis, and anomaly detection. While these methods have improved to some extent, they often struggle to keep pace with changing fraud techniques, resulting in false positives (FPs) and loss of fraudulent activity [9].

In order to promote confidence and responsibility in automated decision-making systems, model interpretability is essential. Understanding the logic underlying complicated ML model predictions becomes increasingly important, as these models—especially deep learning architectures—continue to advance [10]. Transparency is ensured via interpretability, which helps stakeholders understand how and why particular decisions are made. Interpretability is critical for ethical concerns, user acceptance, and regulatory compliance in areas like banking, healthcare, and criminal justice, where results have far-reaching effects. By eliminating potential biases and promoting responsible and informed decision-making, transparent models enable users, regulators, and organizations to validate, question, and enhance decision-making processes [11].

a.
Explainable AI

The term “explainable artificial intelligence” (XAI) describes the creation of AI systems that offer visible and intelligible explanations of their decision-making procedures. Since AI models—especially sophisticated ones like deep neural networks (DNN)—often function as “black boxes,” XAI aims to close the understanding gap between model complexity and interpretability by humans [12]. Enhancing the accessibility, accountability, and trustworthiness of AI systems is the aim. Feature significance analysis, model-agnostic approaches, and producing human-interpretable model internal representations are examples of XAI methodologies. In applications like healthcare, banking, and legal systems, where transparency is critical, ensure explainability. By providing comprehensible justifications for AI judgments, XAI advances user confidence, streamlines regulatory compliance, and aids in the detection and correction of biases, eventually encouraging the responsible and ethical application of AI across a range of areas [13].

b.
Causal AI

The goal of the developing discipline of Causal AI is to introduce causality into ML models so that they may comprehend and make sense of cause-and-effect linkages. While Causal AI goes beyond traditional ML by determining the causative causes underlying observed results, traditional ML focuses on correlations between variables [14]. These models, particularly in complex systems, can provide better informed forecasts and actions by identifying cause-and-effect relationships. In domains like healthcare, economics, and decision support systems, where comprehending the consequences of actions and treatments is critical, Causal AI plays a vital role. By revealing the real causes of observed patterns, lowering biases, and improving the interpretability of AI systems, it helps make better decisions and eventually leads to more dependable and responsible applications of AI [15].

c.
Causal AI versus causal inference

The process of finding and understanding the relationship between variables or events in the data is called causal inference [16]. Determining what and how one affects the other goes beyond simple relationships. This branch of science uses statistical and mathematical tools to establish relationships through observation or testing of data [16]. It helps discover the underlying processes that cause some outcomes to differ from the cause of the relationship. Decision making has important implications in many fields, including epidemiology, social sciences, and economics, because it provides insights that guide decision making, establish policy, and provide a deep understanding of complex processes and situations.

In contrast, Causal AI is a specialized field of AI. It helps them better understand and think about cause-effect relationships by trying to place logical reasons in the cognitive process [17]. Causal AI uses ideas from logical reasoning for AI applications such as healthcare, recommendations, and self-driving cars to improve decision-making and problem-solving through guidance. It bridges the gap between cause and effect and intelligence, enabling intelligent intelligence to make smarter, more informed decisions.

d.
Causal AI versus XAI

XAI aims to model intelligence and its consequences as humans. It aims to gain a deeper understanding of why AI makes certain decisions and explains the model’s thinking. To help people understand AI predictions and recommendations, XAI’s techniques include critical analysis, model disagreement, and visualization tools. The goal is to increase transparency, increase user trust, and make it easier to identify biases or conflicts in AI systems [18]. Causal AI, on the other hand, focuses on understanding and modeling relationships in data. It combines AI with a logical approach, allowing AI to think about cause–effect relationships and actions. Causal AI goes beyond simple relationships, enabling AI to understand and predict the impacts of activities. It is particularly useful in fields such as medicine, business, and self-management, where understanding the importance of results and making informed decisions based on relationships are important [18].

In summary, while XAI deals with the interpretation of intellectual judgments, the realization of intelligence is concerned with the creation of knowledge about causes and effects, and their benefits in the cognitive process, making them more knowledgeable and contextually aware in decisions based on causes and consequences. Both disciplines contribute to the transparency and trustworthiness of AI systems, but in different ways and for different reasons.

The motivation behind this research arises from multiple compelling factors. First, in fraud detection, the consequences of FPs or FNs can be severe [19]. To ensure the widespread adoption of automated fraud detection technologies, decision-making processes must be trustworthy, necessitating proficiency in data interpretation to comprehend the logic of flagged transactions. Second, stringent data protection laws and ethical considerations demand transparency in AI applications, especially concerning sensitive data like financial transactions. Compliance with regulatory frameworks and ethical standards relies on the ability of organizations to justify algorithmic decisions [20]. Finally, the comparative design of the research aims to analyze various interpretability strategies utilized in deep learning models for fraud detection. Through systematic examination and comparison of each strategy’s strengths and weaknesses, the research seeks to facilitate the development of more robust and dependable solutions.

This research advances the subject of fraud detection by conducting a comparative examination of DLE and Causal AI approaches. The study identifies the merits and disadvantages of each technique through empirical examination with real-world datasets. It emphasizes the trade-offs between model interpretability and performance, particularly in DLE models’ opaque decision-making processes, as well as Causal AI’s balanced but relatively poor performance. By giving insights into these trade-offs, the article helps decision-makers choose the best technique for fraud detection applications, particularly in finance. Furthermore, it emphasizes the crucial relevance of transparency and dependability in AI-driven systems, highlighting the need for future research efforts to prioritize interpretability alongside performance measures. These contributions help comprehend the intricacies of fraud detection approaches, paving the door for more transparent and effective AI systems in crucial sectors.

This study contributes significantly to the field of fraud detection by providing a thorough examination of DLE methodologies and Causal AI techniques. By carefully evaluating various strategies and focusing on domain knowledge integration via CausalNex, this study throws light on the trade-offs between model complexity and interpretability in fraud detection systems. The findings emphasize not just the necessity of transparency and interpretability, but also the power of causal reasoning in understanding fraudulent behavior. By addressing the constraints of simulated data and emphasizing the importance of real-world experimentation, the paper lays the groundwork for future research efforts aimed at improving the efficacy and application of Causal AI algorithms in practical fraud detection scenarios.

III.
Literature review

Recognizing how important transparency is for maintaining financial stability, the paper explores the complexities involved in interpreting deep learning models and points out issues such as nonlinearity, intrinsic black-box nature, and the intricacy of high-dimensional data. A range of interpretability techniques are investigated, such as model-agnostic strategies like local interpretable model-agnostic explanations (LIME) [21] and SHapley Additive ExPlanations (SHAP) [21].

A recurring topic in the literature is the careful balancing act between interpretability and model accuracy, which is necessary to achieve successful fraud detection. Interpretability methods must be in line with industry standards in order to comply with regulations and uphold ethical standards. Through case studies and real-world applications, the paper also highlights practical consequences and offers insights into how interpretability might be used in real-world fraud detection settings. In addition, the literature evaluation indicates new trends and suggests avenues for future research, acting as a roadmap for the ensuing comparison analysis. It establishes the groundwork for assessing and developing interpretable deep learning models that are especially designed to tackle the difficulties presented by fraud detection jobs.

Causal AI is a rapidly evolving field that has attracted considerable attention from researchers and practitioners in recent years. This area of research focuses on developing algorithms and methodologies for understanding cause-and-effect relationships in data. By integrating Causal AI principles into fraud detection systems, one can develop more advanced and accurate methods for identifying, preventing, and mitigating fraud. This innovative approach has the potential to revolutionize the field of fraud detection and significantly enhance its effectiveness.

The article [22] emphasized the importance of investing in business and academic efforts to address emerging challenges in ML, deep learning, big data, and computational intelligence. XGBoost emerges as the top performer, exhibiting the fastest computation time and the highest area under the curve (AUC) on SMOTE data. The study compares SMOTE and XGBoost against other credit risk assessment methods such as SVM, KNN, Naïve Bayes, multi-layer perceptron (MLP), Logistic Regression, and Ensemble, using AUC and Runtime as metrics. The findings highlight XGBoost’s effectiveness in producing interpretable model assessments and identifying critical features. However, challenges in recognizing new behavioral patterns are noted due to the simplicity of the SMOTE approach and the lack of newer methods addressing class imbalance.

Traditional auditing procedures sometimes require labor-intensive operations owing to the substantial collection of financial indicators and transaction data. Given the constraints of traditional ML models with unbalanced data, Yi et al. [23] proposed a fresh method for fraud detection. It presented a fraud detection system that combined ML with the Egret Swarm Optimization Algorithm (ESOA), which includes a cost-sensitive objective function and a loss function. Using the AAER benchmark dataset from UCB’s Center for Financial Reporting and Management, the technique outperformed cutting-edge algorithms in terms of accuracy (ACC), sensitivity (SEN), precision (PREC), and AUC. Notably, ESOA achieved a fraud detection accuracy of 96.27%, the highest recorded in the literature. Potential limitations included low participant numbers, potential participant selection bias, a lack of blinding resulting in information bias, subjective pain reporting, potential weather metric limitations, and applicability only to the UK climate, and the assumption of a uniform weather–pain relationship across participants, and consideration of within-subject dependence.

Fanai et al. [24] addressed fraud detection using a binary classification model that distinguishes fraudulent transactions. It provided a two-stage system that combined a deep autoencoder for representation learning with supervised deep learning approaches. Experimental results showed that using this strategy improved the performance of deep learning-based classifiers. Specifically, classifiers trained on deep autoencoder-transformed data beat baseline classifiers based on original data in all performance metrics. The findings demonstrated the effectiveness of the suggested strategy in increasing the performance of deep learning-based fraud detection systems.

Financial statements are important analytical reports that financial organizations produce on a regular basis, providing insights into their performance from numerous perspectives. Given their critical significance in decision-making for stakeholders like creditors, investors, and auditors, certain institutions may alter these reports in order to deceive and commit fraud. Fraud detection in financial statements is to identify abnormalities caused by such distortions and distinguishes between fraudulent and non-fraudulent reports. Binary classification is a common data mining method in this sector, but its success is dependent on access to standardized labeled datasets, which are generally limited due to the scarcity of fraudulent cases in real-world data. To address this issue, Aftabi et al. [25] provided a unique technique that combined generative adversarial networks with ensemble models, providing a solution to both the scarcity of non-fraudulent samples and the complexity of high-dimensional feature spaces.

Jiang et al. [26] discussed current credit card fraud detection algorithms and offered a new credit card fraud detection network based on the unsupervised attentional anomaly detection paradigm. The Fraud Detection Dataset showed that their strategy outperformed existing ML and deep learning-based fraud detection approaches. Furthermore, they validated the proposed UAAD-FDNet’s effectiveness using Kaggle’s credit card fraud detection dataset. According to a comparative examination, their methodology outperformed long short-term memory (LSTM), convolutional neural network (CNN), and MLP approaches in terms of fraud detection. Furthermore, the findings from the detection dataset showed that their suggested strategy efficiently addressed data imbalance concerns, resulting in better fraud detection performance.

In recent years, significant economic growth has resulted in the steady proliferation of financial fraud (FF). A systematic analysis is provided by including extensive definitions of mathematical expressions for CNNs, support vector regression, the suggested approach, and performance assessment measures. Huang et al. [27] used root mean square error to evaluate fraud detection performance for each transfer record, comparing CNN to the suggested technique. To validate the efficacy of the suggested approach, it is compared with two prominent AI methods, CNN and SVR, which used the identical transfer record dataset. The proposed fraud detection system combined deep learning techniques such as Attentive Transformer, Feature Transformer, and auxiliary operations. Experimental results show a considerable boost in detection performance over various major AI systems.

Valavan and Rita [28] examined the performance of several ML methods, such as decision tree (DT), random forest (RF), linear regression (LR), and gradient boosting (GB), in identifying and forecasting fraud situations based on loan fraud manifestations. The Gini score was used to set the DT construction and data split criteria, with RF models starting with 100 trees but overfitting at 60 trees. The goal was to investigate, assess, and compare the effectiveness of several ML algorithms in identifying persons with a high risk of loan default. The RF classifier obtained 80% accuracy, whereas LR achieved 70%. Notably, the RF model outperformed LR in predicting loan defaulters, with an overall accuracy of 91.53%, precision of 77.22%, recall of 78.16%, and f1-score of 71.22%. Traditional ML approaches were chosen due to the dataset’s tabular nature and continuous values.

Credit cards are essential in today’s digital economy, but their expanded use has resulted in a boom in credit card fraud. ML methods have proved useful in detecting fraud, however, dynamic shopping patterns and class imbalance provide barriers to optimal classifier performance. Mienye et al. [29] tackled these challenges by presenting a strong deep-learning technique that uses LSTM and gated recurrent unit neural networks as base learners in a stacking ensemble architecture, with an MLP acting as the meta-learner. To address class imbalance, the hybrid synthetic minority oversampling methodology and the modified closest neighbor method are used. Experimental findings showed that their methodology outperformed other frequently used ML classifiers and approaches in the literature, with a sensitivity and specificity of 1.000 and 0.997, respectively.

Fakiha [30] investigates forensic credit card fraud detection using DNN approaches. Given the importance of high true positive (TP) rates and low FPs in fraud detection systems, there is an urgent need for improved detection methods. They emphasized on constructing a deep learning method, especially an LSTM model combined with different ML approaches, for forensic credit card fraud transaction detection. The model’s validation, which used two credit card transaction datasets, resulted in much higher sensitivity in detecting credit card fraud incidents. When compared with other methodologies, the model outperformed them all in terms of forensic credit card identification.

Zhou et al. [31] focused on generating user-centered explanations for FF detection models using XAI methods. By integrating an ensemble predictive model with an explainable framework based on Shapley values, the authors devised an accurate and interpretable approach for FF detection. Their results demonstrated that the explainable framework satisfies the needs of diverse external stakeholders by offering both local and global explanations. Local explanations aided in understanding why particular predictions are flagged as fraud, while global explanations unveiled the overarching logic of the ensemble model.

The research study by Raval et al. [32] offered RaKShA, a revolutionary CC fraud detection technique that combined XAI with LSTM as X-LSTM, which is verified via smart contracts. Using interplanetary file systems (IPFS) on the public Blockchain (BC), the approach improved CC-FF detection significantly. X-LSTM considerably increased LSTM power, ensuring scalability and adaptation to FF situations. After 500 epochs, XAI achieved 99.8% accuracy, outperforming LSTM by 17.41%. RaKSha is a useful, cost-effective CC-FF auditing tool since SC and public BC enable universal access and verifiability of fraud detection data. Their paper acknowledged possible limits and suggested future advancements, emphasizing the scheme’s continuing progress and refinement.

Hamza et al. [33] focused on money laundering difficulties in financial institutions, contextualizing transactions inside the SWIFT network. It provided a multifraud categorization approach based on ML techniques for detecting inactive accounts, smurfing, and large-scale fraud. Their study intends to improve fraud detection efficacy while taking into account the financial institution’s operating costs and risk management techniques.

Zhang et al. [34] created a financial risk early warning model called the d-S Evidence theory-XGBoost (DS-XGBoost), and then used SHAP to assess the model’s explainability. Using data from China’s listed manufacturing businesses from 2012 to 2021, the study combined financial indicators, corporate governance, and management perception to create a complete financial risk early warning system. Their results showed that the model improved financial risk prediction performance, quantified feature contributions and correlations, increased explainability and dependability, and provided information consumers with a more precise foundation for decision-making.

Sahoh et al. [35] unveiled a causal graph aligned with cognitive understanding, measuring causal odd ratios to assess compatibility with real-world data. Causal effect relationships are rigorously verified via causal P-values and confident intervals, demonstrating less than a 1% chance of random occurrence. These findings underscored the model’s proficiency in encoding precise, robust relationships, facilitating the emulation of human intelligence by software agents. Such advancements enabled their deployment in critical decision-making applications, enhancing their capacity to navigate complex scenarios effectively.

Kumar et al. [36] emphasized the importance of causal inference in behavioral science research, rather than statistical approaches. They provided a detailed assessment of causal inference approaches in the Banking, Financial Services, and Insurance (BFSI) industry, highlighting a recent spike in their growth. Despite this, causal inference remains largely unexplored in banking and insurance, necessitating more study to determine its usefulness. The paper outlined outstanding research questions, with a special emphasis on defining counterfactual optimization as a multi-objective optimization (MOO) issue. To overcome these problems, proposed methods include using MOO algorithms such as Non-dominated Sorting Genetic Algorithm II (NSGA-II), Multi-Objective Evolutionary Algorithm Based on Decomposition (MOEA/D), and Non-dominated Sorting Particle Swarm Optimization (NSPSO).

a.
Summary of the literature review

The preceding literature assessment highlighted major research gaps in the field of fraud detection, specifically the limits of established methodologies for revealing causal links and responding to changing fraudulent strategies. Furthermore, the ramifications of false negatives (FNs) in fraud detection were discussed, underlining the necessity for stronger security measures. The research article distinguishes itself by addressing these shortcomings using an innovative approach: methodically comparing and assessing two approaches, DLE with XAI and Causal AI, in the context of fraud detection. This study addresses a critical gap in the literature by not only increasing the transparency of deep learning models, but also investigating the potential of causal reasoning to improve comprehension and decision-making processes. By rigorously examining and comparing the benefits and limits of each approach, the work gives vital insights for developing more reliable and transparent fraud detection systems.

The study is unique in that it is the first to integrate Causal AI into fraud detection, as opposed to XAI, indicating a significant advancement in the field. This unique comparative examination not only broadens present understandings, but also lays the groundwork for future research and real-world applications of explainable and causal reasoning methodologies. Crucially, the research is significant because it takes a creative approach to fraud detection, bringing a new lens that promises increased accuracy and interpretability. The research has far-reaching implications for reducing financial losses and strengthening security measures in a variety of industries. This work not only enhances theoretical understanding, but also has practical implications, potentially altering fraud detection tactics and increasing trust in automated systems.

IV.
Methodology
a.
Dataset

The Bank Account Fraud (BAF) suite of datasets has been published at NeurIPS 2022, and it comprises six different synthetic BAF tabular datasets. BAF is a realistic, complete, and robust test bed to evaluate novel and existing methods in ML and fair ML, and is the first of its kind [37].

This suite of datasets is as follows:

  • Realistic: It is based on a present-day real-world dataset for fraud detection;

  • Biased: Each dataset has distinct controlled types of bias;

  • Imbalanced: This setting presents an extremely low prevalence of positive class;

  • Dynamic: It has temporal data with observed distribution shifts;

  • Privacy preserving: It protects the identity of potential applicants that have applied differential privacy techniques (noise addition), feature encoding, and trained a generative model (CTGAN).

The six different synthetic BAF tabular datasets are shown in Table 1.

Table 1:

Different synthetic BAF tabular datasets

BaseSampled to best represent original dataset
Variant IHas higher group size disparity than base
Variant IIHas higher prevalence disparity than base
Variant IIIHas better separability for one of the groups
Variant IVHas higher prevalence disparity in train
Variant VHas better separability in train for one of the groups

BAF, Bank Account Fraud.

Each dataset is composed of the following:

  • 1 million samples: Each dataset is designed to handle big data and contains 1 million samples. This comprehensive assessment provides a detailed and comprehensive examination of the underlying patterns and the patterns in the data.

  • 30 real-world features for fraud detection: The dataset is supplemented with 30 real-world features carefully selected to reflect the complexity of the phenomenon in fraud detection. These behaviors are carefully selected to capture a variety of variables important to fraud, allowing fraud to be more accurately diagnosed.

  • Time data column (month): The data contain a special column labeled “Change,” which is the main time data. This temporal dimension allows investigation of the temporal patterns and changes in data, providing important information about the temporal dynamics of fraudulent activity over time.

  • Protective characteristics: The data include important protective characteristics such as age group, occupation, and income percentage. These features play an important role in maintaining the integrity and fairness of data analysis by preventing bias and injustice. By highlighting the protective features, the dataset allows the public to examine the impact on fraud outcomes, thereby increasing transparency and accountability in the investigation process.

b.
Proposed methodology

The methodology (Figure 1) encompasses several key steps. First, data collection involves obtaining pertinent datasets for fraud detection, ensuring diversity in fraud types, transaction volumes, and contextual information. Data preprocessing and cleaning are conducted to address missing values, outliers, and maintain consistency across datasets. Second, model selection entails choosing representative deep learning architectures such as CNNs, RNNs, and transformer models. Baseline models are trained on selected datasets to optimize fraud detection performance. Third, interpretability techniques, including SHAP values, LIME, feature importance analysis, and layer-wise relevance propagation, are implemented on trained models. Integration with chosen deep learning architectures is ensured. To enhance interpretability, Causal AI methods are incorporated into deep learning models to explore the role of causality in understanding fraud detection decisions. Fourth, quantitative performance metrics are defined to evaluate both model accuracy and interpretability in fraud detection. Metrics such as precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC-ROC) are considered, along with appropriate interpretability metrics. Finally, trade-off analysis involves systematically adjusting interpretability technique parameters to investigate the balance between model performance and interpretability. Different interpretability thresholds are considered to quantify the impact of interpretability on model accuracy and vice versa.

Figure 1:

Research methodology flowchart.

c.
Computational environment

The deep learning model was developed using Google Colab, a cloud-based platform that provides access to graphics processing unit (GPU) resources for accelerated model training. The runtime environment was set to Python3 to leverage the latest features and libraries. Specifically, the model benefited from the hardware accelerator T4 GPU, which significantly expedited the computation-intensive tasks involved in training DNNs. Additionally, explanations of the model’s predictions were generated using LIME and KernelSHAP, aiding in understanding the model’s decision-making process by providing insights into feature importance and contribution to individual predictions. The training of the deep learning model and then explaining them with LIME and SHAP took around 45–60 min.

The computational environment utilized in this research study comprises an Intel Core i5-10300H CPU operating at 2.50 GHz, 8 GB of RAM, and a 64-bit Windows 10 Home Single Language operating system (version 22H2). The research employed Anaconda version 3 as the Python environment, with Python version 3.11.5 installed. This configuration facilitated implementation of the CausalNex library for creation of Bayesian networks (BNs), enabling the analysis of causal relationships within the dataset. Each creation of the BN took around 60–120 min.

d.
Data preprocessing

Because raw datasets frequently contain noise, missing values, and inconsistencies that can impede the performance of ML algorithms, data preprocessing is essential. Preprocessing ensures that the information fed into models is accurate, complete, and appropriately formatted by carefully cleaning, transforming, and organizing the data, enhancing the models’ ability to learn meaningful patterns and make accurate predictions. Following are the steps undertaken in preprocessing:

  • Handling null values: If the missing data is small and random, one can delete rows or columns with missing values. If it is not, one can use imputation which entails filling in the blanks with approximated or calculated values.

  • Converting categorical columns to numerical: Since ML models are designed to handle numerical data, categorical data needs to be transformed before being used in ML models. Label coding, single-bit coding, and binary coding are all methods of converting categorical data into numerical data.

  • Standardization of data: Data standardization is the process of putting data into a common format to ensure consistency and comparability across different sources or systems. This often involves scaling numbers to have a mean of 0 and a standard deviation of 1, increasing the interpretability and performance of ML algorithms. Standardization of data will reduce the difference in scales and units, allowing for a clear analysis and decision-making process.

  • Balancing the dataset: Balancing the dataset is altering the distribution of classes or categories in the data to reduce biases and improve model performance. This can be accomplished by either oversampling minority classes, undersampling majority classes, or using more complex techniques such as synthetic data. Balancing the dataset guarantees that the model is not biased toward predicting the majority class and can learn from all available data, resulting in more accurate and trustworthy predictions.

e.
Deep learning model

A deep learning model is an artificial neural network with several layers that interpret input data in progressively abstract ways. These models excel in extracting intricate patterns from big datasets, allowing for applications such as image recognition, natural language processing, and speech synthesis [38]. Deep learning models may extract features and produce high-accuracy predictions across multiple domains by iteratively modifying internal parameters using a method known as backpropagation [39].

The first trained model is a sequential model, meaning the layers are linear. It starts with a convolutional layer [40] (Conv1D) followed by a max pooling layer [40] (MaxPooling1D). The Conv1D method extracts features from the input data, whereas the MaxPooling1D method minimizes the spatial size of the feature map. After the pooling layer, the data are flattened into a one-dimensional array. After the planarization layer, there are two denser layers. These details are summarized in Table 2. The first thick layer usually examines complex patterns in ideas, and the last thick layer makes predictions. The model shows the parameters, which are the different values learned during the training process, and determines whether these parameters can be learned or not. This model appears to be suitable for tasks that require a set of data, such as time series or data links, and the final system will likely show a binary classification due to its single value.

Table 2:

Layers of the first deep learning model

Model: “sequential”

Layer (type)Output shapeParam #
conv1d (Conv1D)(None, 29, 64)256
max_pooling1d (MaxPooling1D)(None, 14, 64)0
flatten (Flatten)(None, 896)0
dense (Dense)(None, 64)57,408
dense_1 (Dense)(None, 1)65

Total params: 57,729
Trainable params: 57,729
Non-trainable params: 0

The second model is a sequential neural network architecture designed to handle sequential data. It starts with a Conv1D layer, which uses one-dimensional convolution with 128 filters and a kernel size of three to extract local patterns from the input sequence. Following that, a MaxPooling1D layer is used to downsample the data by selecting the maximum value within a fixed-sized window. The flattened layer then reshapes the output into a one-dimensional tensor, preparing it for the next fully connected layers. Two dense layers follow: the first with 128 neurons, and the second as the output layer with a single neuron, which is commonly used for binary classification tasks. These details are summarized in Table 3.

Table 3:

Layers of the second deep learning model

Model: “sequential”

Layer (type)Output shapeParam #
conv1d (Conv1D)(None, 27, 128)768
max_pooling1d (MaxPooling1D)(None, 13, 128)0
flatten (Flatten)(None, 1664)0
dense (Dense)(None, 128)213,120
dense_1 (Dense)(None, 1)129

Total params: 214,017
Trainable params: 214,017
Non-trainable params: 0
f.
Accuracy assessment

These are common evaluation metrics used in classification tasks to assess the performance of an ML model.

  • Accuracy: It evaluates the model’s overall correctness in predicting the classes. It is the number of accurately predicted instances (TPs and true negatives [TNs]) divided by the total number of instances [41]: (1) Accuracy=TrueNegative+TruePositiveTrueNegative+FalseNegative+TruePositive+FalsePositive Accuracy = {{True\;Negative + True\;Positive} \over {True\;Negative + False\;Negative + True\;Positive + False\;Positive}}

  • Precision: It reflects how many of the projected positive cases actually occur. It is the ratio of genuine positives to the sum of TPs plus FPs [41]: (2) Precision=TruePositiveTruePositive+FalsePositive Precision = {{True\;Positive} \over {True\;Positive + False\;Positive}}

  • Recall: It assesses how many genuine positive examples are captured by the model. It is the ratio of TPs to the sum of TPs plus FNs [41]: (3) Recall=TruePositiveFalseNegative+TruePositive Recall = {{True\;Positive} \over {False\;Negative + True\;Positive}}

  • F1 Score: It is the harmonic mean of precision and recall, yielding a single score that balances the two criteria. It is useful when you want to strike a balance between precision and recall [41]: (4) F1Score=2*Precision*RecallPrecision+Recall F1\,Score = 2*{{Precision*Recall} \over {Precision + Recall}}

  • Confusion matrix: It is a table that is frequently used to describe a classification model’s performance on a collection of test data that contains known true values. It is a matrix, with each row representing examples in an actual class and each column representing instances in a forecast class [41].

    In the confusion matrix:

    • The top-left cell represents real negatives;

    • The top-right cell displays FPs (instances that were mistakenly forecasted as positive);

    • The bottom-left cell reflects erroneous negatives;

    • The bottom-right cell represents TPs (occurrences that were successfully predicted as positive).

V.
Results

In this section, the empirical findings of a study comparing DLE with Causal AI for fraud detection are presented. Both methodologies’ performance, explainability, durability, and computational efficiency are examined through comprehensive testing and analysis. The purpose is to provide insights that help comprehend the trade-offs and benefits of each technique in the context of fraud prevention.

a.
Deep learning model

Table 4 presents the evaluation metrics of the deep learning models, showcasing their performance across key indicators.

Table 4:

Evaluation matrix of the deep learning models

Accuracy (%)Precision (%)Recall (%)F1 score (%)
Model 194.6497.4297.7094.47
Model 296.3496.0696.6496.35

In the confusion matrix shown in Figure 2A, a classification scenario is presented where the model makes predictions on the dataset by performing the following calculations: 289,505 TNs; 7,190 FPs; 24,613 FNs; and 272,078 TPs. A FP is a situation where the model correctly identifies a negative outcome; a FP is when the model incorrectly predicts a positive outcome. A FN is a situation where the model incorrectly predicts a negative outcome; a TP is when the model correctly detects a positive outcome. These data show the model’s performance in multiple forecasting scenarios and reveal its strengths and weaknesses in data classification.

Figure 2:

Confusion matrix of the (A) first and (B) second deep learning models.

In the matrix shown in Figure 2B, a distribution scenario is presented where the model predicts the dataset by performing the following calculations: 274,280 TNs; 22,412 FPs; 3,395 FNs; and TPs is 293,296. A FP is a situation where the model correctly identifies a negative outcome; a FP is when the model incorrectly predicts a positive outcome. A FN is a situation where the model incorrectly predicts a negative outcome; a TP is when the model correctly detects a positive outcome. These data show the model’s performance in multiple forecasting scenarios and reveal its strengths and weaknesses in data classification.

While the reported accuracy, precision, recall, and F1 score reflect the model’s overall performance, further investigation of the classification findings is required, particularly in terms of FNs. The high number of FNs suggests that the classification algorithm failed to detect fraud. This omission could result in financial losses, reputational damage, and regulatory non-compliance for firms that rely on fraud detection technologies. Furthermore, FNs might enable fraudulent operations to go unnoticed, emboldening perpetrators and compounding the impact of fraudulent conduct over time.

b.
LIME

LIME is an ML technique that provides explanations for specific complex model predictions, regardless of type [42]. It tries to improve model transparency and interpretability by producing locally faithful explanations that assist explain why a model produced a specific prediction for a given instance.

LIME accomplishes this by approximating the behavior of the black-box model around the instance of interest using local surrogate models, such as linear models, with more accessible explanations [42]. By perturbing the input data and analyzing the changes in predictions, LIME detects relevant features and weights them, allowing users to obtain insights into the underlying model’s decision-making process at the local level.

Results (Figure 3A) from LIME provide insight into key features that help classify events as “fraud” or “not fraud.” It is worth noting that variables such as “user age,” “monthly_session_length,” and “employment_status” are more important in predicting cases flagged as “fraud.” By contrast, features such as “device_distinct_emails_8w,” “has_other_cards,” “keep_alive_session,” “name_email_similarity,” “housing_status,” “propose_credit_limit,” and “bank_branch_count_8w” proved to have a significant impact on “Ncount”. Fake”. The choice of the final classification is based on an understanding of the model’s starting point in relation to these features, to which each feature contributes to the outcome of the classification. This explains how the model uses some useful parameters and thresholds to generate predictions, providing insight into the underlying mechanisms that drive the classification process.

Figure 3:

LIME explanation for single instance for (A) first and (B) second deep learning model. LIME, local interpretable model-agnostic explanations.

The insights generated from the LIME output shown in Figure 3B show that features such as “session_length_in_minutes” and “e-mail_is_free” play a significant influence in identifying occurrences categorized as “Fraud,” emphasizing their importance in differentiating fraudulent activity. By contrast, attributes such as “source,” “device_distinct_emails_8w,” “has_other_cards,” “foreign_request,” “keep_alive_session,” “bank_months_count,” “days_since_request,” and “prev_address_months_count” emerge as key contributors to the “Non-Fraud” class, highlighting their importance in identifying legitimate transactions. The final classification decision is governed by the deep learning model’s thresholds, which use the distinct patterns captured by these features to effectively differentiate between fraudulent and non-fraudulent instances, improving the model’s predictive accuracy and reliability.

c.
KernalSHAP

KernelSHAP is a variation of SHAP, a well-known technique for explaining ML model output. It is specifically used to explain individual model predictions by approximating the Shapley values, which quantify each feature’s contribution to the prediction [43].

KernelSHAP approximates Shapley values by selecting subsets of features and estimating the model’s output for each one. These estimations are then blended to form a weighted average, with the kernel function determining the weights [43]. This method enables KernelSHAP to efficiently provide local explanations for individual predictions, making it ideal for big and complex models.

The force plot, commonly used in SHAP, offers an intuitive visualization of feature contributions to individual predictions made by ML models. Each vertical line in the plot represents a feature, with colored segments indicating the magnitude and direction of their impact on the model’s prediction. The length and direction of these segments signify the extent to which each feature pushes the model’s output towards higher or lower values compared with the base value. By visually depicting how individual features influence a specific prediction relative to the model’s overall average prediction, force plots enable users to grasp the factors driving the model’s decisions at the instance level, facilitating model interpretation and understanding.

This force plot (Figure 4A), created with Kernel SHAP and displayed in markdown, illustrates the link between features and the target variable. The x-axis shows the features, while the y-axis shows the target variable. Line color signifies strength (red for positive, blue for negative), whereas line width represents uncertainty. Key characteristics such as date_of_birth, distinct_emails_4w, credit_score, and payment_type have significant positive associations with the target variable, indicating greater values. Less essential characteristics, such as phone_home_valid and quantity, display weak negative associations and have lower values. The plot’s fundamental discovery is that it uses line width to illustrate uncertainty, with broader lines indicating higher uncertainty in the relationship between characteristics and the target variable.

Figure 4:

Force plot of the (A) first deep learning model and (B) second deep learning model.

This force plot (Figure 4B) shows the correlations between credit_risk_score, phone_home_valid, and ba_risk_score. Credit_risk_score ranges from 0 to 1, showing credit default risk; phone_home_valid varies from 0 to 1, suggesting chance of answering the phone; and ba_risk_score ranges from 0 to 1, indicating bankruptcy risk. The plot shows a substantial positive association between credit_risk_score and ba_risk_score, showing that higher credit risk is associated with higher bankruptcy risk. By contrast, there is a large negative link between phone_home_valid and ba_risk_score, indicating that consumers who answer calls are less likely to file for bankruptcy. There is also a modest positive association between credit_risk_score and phone_home_valid, showing that consumers with higher credit risk are somewhat more likely to answer calls. This animation demonstrates how these factors interact, providing critical insights for risk assessment and consumer engagement strategies.

The Kernel SHAP summary plot (Figure 5) depicts a model for estimating loan default probabilities. Each feature’s mean SHAP value shows the average influence on predictions, with the most important factors at the top. “intended_balcon_amount,” “payment_type,” and “phone_home_valid” are particularly important, since they show financial condition and payback capabilities. Other useful tools include “proposed_credit_limit” and “credit_risk_score,” which probe into credit history and bank relationships. Furthermore, the graphic illustrates the SHAP value distributions for each attribute, demonstrating their various consequences. For example, “intended_balcon_amount” has a variety of effects on default likelihood. In essence, the SHAP summary map provides a complete overview of prediction impacts, highlighting important variables that are critical to understanding and interpreting the model’s behavior in assessing default risk.

Figure 5:

Summary plot of the first deep learning model.

The SHAP summary graphic depicts feature effects on model output. The x-axis represents features, whereas the y-axis represents SHAP values. The color of the bar reflects the direction of impact (red for positive, blue for negative), and the height of the bar represents the amount of the impact. Key characteristics, such as intended_balcon_amount and velocity_4w, have a large influence at the plot’s top. By contrast, characteristics having little significance, such as device_os and bank_months_count, are near the bottom of Figure 6. This graphic helps to identify important model characteristics and understand their impact on predictions. It is a useful tool for model interpretation and decision-making, providing insights into which factors influence model outputs and to what extent.

Figure 6:

Summary plot of the second deep learning model.

d.
Causal AI–CausalNEX Library

CausalNex is a Python module that employs BNs to integrate ML with domain knowledge for causal reasoning. It can be used to discover structural correlations in the data, learn complex distributions, and assess the effectiveness of prospective interventions [44].

CausalNex is a Python module that supports causal reasoning and BNs. It includes tools for creating, assessing, and interpreting causal models using data. CausalNex allows to do causal inference tasks such as detecting causal links between variables, evaluating intervention effects, and forecasting unobserved variable outcomes [44]. Here is a summary of some main features and functionalities of CausalNex:

  • BNs: CausalNex enables you to design BNs, which are graphical models that depict probabilistic correlations between variables. BNs are effective for modeling complicated systems that contain uncertainty and dependency.

  • CausalNex: It facilitates the creation of Structural Causal Models (SCMs), which express causal relationships between variables using structural equations. SCMs are useful for analyzing and replicating the consequences of interventions.

  • Causal discovery techniques: The collection includes techniques for determining the structure of BNs using observational data. These algorithms assist in identifying causal linkages and determining the underlying causal mechanisms from data.

  • Integration with Pandas: The library works perfectly with the Pandas library, making it simple to work with tabular data and conduct causal analyses on real-world datasets.

In the context of CausalNex, the first step is to create a structured causal model (SCM), which serves as the core framework for building a BN. Three unique SCMs were designed and then rigorously tested. The findings of these tests are offered below for further study and interpretation. The initial SCM depicted in Figure 7A was developed with a design, wherein all attributes were configured with outward edges directed toward the target variable.

Figure 7:

(A) First SCM; (B) Second SCM; and (C) Third SCM. SCM, structured causal model.

The second SCM depicted in Figure 7B was created using core logical concepts and basic financial domain knowledge. Unlike its predecessor, this model does not have direct linkages to the target variable. And the third SCM depicted in Figure 7C was built using the CausalNex library’s from_pandas method. This function allows you to create an SCM directly from a pandas DataFrame, a popular data manipulation tool in the Python community. The from_pandas function examines the DataFrame’s structure and deduces causal linkages between variables based on observed data patterns. Using this method, one can speed up the SCM generation process and draw causal insights from tabular data without having to manually specify causal linkages.

Three BNs were methodically created to reflect the intricate causal links inside a given system or area, building on previously established SCMs. These BNs underwent extensive training methods in which they were given relevant data to learn and change their probabilistic parameters. Subsequently, rigorous testing protocols were used to measure the BN prediction powers and performance in inferring unknown variables or outcomes from observed data. The outcomes of these training and testing processes are now available for review, providing insights into the created BN effectiveness and dependability in capturing underlying causal structures and producing educated predictions within the given environment.

The BN CausalNex’s poor performance shown in Table 5 can be linked to a number of significant design and operational flaws. To begin with, the model’s reliance on the quality and quantity of data for training is critical, and when confronted with noisy, incomplete, or biased data, the network’s performance suffers. Furthermore, BNs, such as CausalNex, rely on fundamental assumptions about variable independence and causal linkages. When these assumptions are broken within the dataset, as is common in real-world circumstances, the model struggles to effectively reflect the underlying dynamics, resulting in poor performance. Furthermore, the inherent complexity of BNs exacerbates these issues, especially when dealing with a large number of variables or complex causal linkages.

Table 5:

Evaluation matrix of the BNs

Accuracy (%)Precision (%)Recall (%)F1 score (%)
SCM 149.3255.6650.7936.59
SCM 259.2760.5059.0257.67
SCM 354.9463.0055.6547.95

BNs, Bayesian networks.

In such cases, precisely calculating the model’s parameters becomes extremely challenging, further limiting its predictive potential. As a result, the restrictions caused by data quality, violated assumptions, and model complexity all contribute to BN Causalnex’s underperformance, emphasizing the importance of ongoing refinement and adaptation to meet these inherent issues.

d.i
Validation of Causal AI model

To validate the findings and ensure the robustness of this work, a number of experiments were performed to train the Causalnex model on various datasets. To ensure consistency in the data preparation process, the new data was initially processed using the same procedures as the original dataset. This approach seeks to determine the generalizability of the findings about the convergence of data quality and performance models across several data sources. The performance of the learnt Causalnex model on new data was assessed and compared with earlier tests. This process of generalization allows for an assessment of the findings’ robustness and provides insight into their relevance within the CausalNex framework.

The first new data [45] used in the study is a simple fraud data with nine features, including features such as time step, transaction rate, and transaction type for the proposed business plan. The database contains a total of 6,362,620 samples, providing a large and diverse collection of samples for training and evaluation. Despite its simple structure, these data provide valuable information for evaluating the performance of CausalNex test models.

The second dataset [46] is made up of transactional information with different properties. Each item represents a single transaction and includes information such as the distance from the transaction location to the individual’s residence, the distance from the current location to the previous transaction location, and the transaction amount’s ratio to the median purchase price. It also includes binary indicators that indicate whether the transaction involved a repeat retailer, the use of a chip or PIN number for authentication, and whether the purchase was completed online. In addition, a binary flag indicates potential fraud. These features provide insights into transactional behavior, authentication procedures, and potential fraudulent actions.

Transactional data with multiple attributes associated with each transaction can be found in the third dataset [47]. Each item represents a single transaction and includes details such as the merchant ID, transaction amount, and whether the transaction was refused or fraudulent. It also shows whether the transaction occurred in a foreign or high-risk country, the average daily transaction amount, the total number of declines per day, and chargeback-related metrics like the daily average chargeback amount and the 6-month average chargeback amount. These capabilities offer a comprehensive picture of transactional behavior, fraud issues, and chargeback patterns.

The CausalNex library’s from_pandas method was used to construct an SCM, which produced the results shown in Figure 8. These visual representations explain the behavior and interrelationships seen in the datasets. The SCM is essential for training BNs since it allows the capture and inference of relationships within the data.

Figure 8:

(A) SCM of the first new data; (B) SCM of the second new data; and (C) SCM of the third new data. SCM, structured causal model.

Following the conclusion of training and testing, the model was analyzed to provide information on its efficacy and efficiency.

Findings in Table 6 show that the CausalNex model performs admirably when trained on new data. As a result, the initial models tested on the NeurIPS dataset performed poorly due to an inability to create substantial correlations between features. The CausalNex model’s capacity to discover and integrate these critical links elucidates distinctions among genuine models. This study emphasizes the importance of completely including all causal links within the data to ensure the robustness and effectiveness of prediction models. Moving forward, this understanding can serve as a beneficial guide for better modeling processes, ultimately improving future models’ predictive powers and producing more exact and reliable results.

Table 6:

Evaluation matrix of the BN for new data

No. of columnsAccuracy (%)Precision (%)Recall (%)F1 score (%)
New Data 11090.3690.6991.6790.32
New Data 2878.478.478.478.4
New Data 31177.9477.9477.9477.94

BNs, Bayesian networks.

e.
Summary of results

The findings of the comparison study shed light on the efficacy and interpretability of DLE approaches and Causal AI for in fraud detection. The findings show indicate that, while deep learning models are highly accurate in excel at detecting fraud, their lack of interpretability makes it difficult to grasp challenges understanding of the decision-making process. However, approaches like LIME and KernelSHAP provide useful insights into feature relevance significance and model predictions, increasing improving transparency and assisting with facilitating model interpretation. Furthermore, CausalNex uses the study revealed the efficacy of Causal AI techniques, specifically CausalNex, in utilizing domain knowledge to create construct causal linkages, resulting in a better understanding of the elements that influence factors impacting fraudulent conduct. Overall, the findings indicate that combining deep learning models with explainability approaches and Causal AI can improve fraud detection system the performance and interpretability of fraud detection systems.

VI.
Discussion
a.
Main findings of the present study

The study looked on the efficacy and interpretability of DLE approaches and Causal AI in fraud detection. The findings indicate that, while deep learning models excel at detecting fraud, their lack of interpretability challenges understanding of the decision-making process. However, approaches like LIME and KernelSHAP provide useful insights into feature significance and model predictions, improving transparency and facilitating model understanding. Furthermore, the study revealed the efficacy of Causal AI techniques, specifically CausalNex, in utilizing domain knowledge to construct causal linkages, resulting in a better understanding of the factors impacting fraudulent conduct. Overall, the findings indicate that combining deep learning models with explainability approaches and Causal AI can improve the performance and interpretability of fraud detection systems.

b.
Comparison with other studies

The findings are consistent with previous research [48] showing a trade-off between model complexity and interpretability in fraud detection. Consistent with previous research, the importance of explainability techniques in clarifying the decision-making mechanisms of deep learning models is highlighted. However, this study provides fresh insights by focusing on the use of Causal AI approaches, particularly CausalNex, which goes beyond traditional methods by combining domain knowledge for causal inference in fraud detection.

c.
Implication and explanation of findings

The results have important consequences for fraud detection and its stakeholders. By emphasizing the importance of model interpretability and causal insights, the work helps to enhance transparent and reliable fraud detection systems. The incorporation of Causal AI approaches such as CausalNex shows potential in improving fraud detection accuracy while also allowing stakeholders to make better informed decisions. These implications extend beyond fraud detection to other fields where interpretability and causal reasoning are critical for assuring model transparency and dependability.

d.
Strengths and limitations

A major component of the paper is its thorough examination of various explainability methodologies and Causal AI algorithms in the context of fraud detection. The emphasis on integrating domain knowledge using CausalNex is particularly significant, as it represents a fresh and innovative way to improving interpretability and comprehension in fraud detection systems. The juxtaposition of diverse explainability approaches with Causal AI solutions provides useful insights into their respective contributions and efficacy.

However, it is critical to recognize several limits. To begin with, the use of simulated data in the study may limit the applicability of the findings to real-world settings, as the complexities and nuances of actual fraud scenarios may not be adequately captured. Furthermore, the evaluation of Causal AI algorithms may be limited by the availability and quality of domain information, making it difficult to establish causal correlations in real-world data. Furthermore, the performance of BNs trained with CausalNex may be influenced by factors such as data quality, quantity, and the causal model’s assumptions.

Future research should seek to solve these constraints by conducting experiments on a variety of real-world datasets, increasing the external validity of the findings. Furthermore, efforts should be aimed on increasing the incorporation of domain knowledge into Causal AI algorithms and rigorously evaluating their performance in real-world fraud detection scenarios. Despite these limitations, the study is a significant step forward in the use of Causal AI in fraud detection, and it adds to the continuing discussion about the integration of explainable and Causal AI approaches in ML applications.

VII.
Conclusion

The study’s conclusion provides valuable insights into the landscape of classification algorithms, notably in the field of fraud detection. A comprehensive study of several models, including deep learning architectures, LIME, KernelSHAP, and the CausalNex BN, reveals nuanced strengths and limits in each technique. Despite deep learning models’ outstanding performance in overall classification accuracy, the prevalence of FNs, particularly in fraud detection scenarios, highlights the critical need for improved interpretability and transparency. Techniques like LIME and KernelSHAP give vital interpretability by explaining individual forecasts and feature relevance, resulting in actionable insights for decision makers. However, the limitations encountered with the CausalNex BN underscore the continued need to refine causal reasoning approaches in order to effectively address data quality issues and model complexity.

The scientific relevance of this study stems from its comprehensive comparison of several categorization techniques, which sheds insight into their respective capabilities and limits in real-world applications, particularly in the realm of fraud detection. By outlining the trade-offs between model performance and interpretability, the findings provide practical advice for both practitioners and researchers. Furthermore, identifying problems and opportunities for improvement, such as improving model interpretability and addressing data quality issues, paves the way for future research efforts to advance the state-of-the-art in AI-driven fraud detection systems.

Furthermore, this study sheds light on Causal AI in the context of fraud detection, which has previously received little attention. By incorporating and evaluating the CausalNex BN, the study demonstrates the potential of Causal AI to provide deeper understanding and explanations of fraud patterns than conventional deep learning models. This investigation into Causal AI emphasizes the importance of developing strong causal reasoning techniques to improve the interpretability and reliability of fraud detection algorithms.

These findings are applicable beyond fraud detection, offering vital insights into the larger landscape of classification approaches and their implications for transparent and reliable AI systems in a variety of fields. Furthermore, the study emphasizes the value of interdisciplinary collaboration among AI academics, domain experts, and industry stakeholders in addressing difficult challenges efficiently.

In conclusion, this study contributes to the ongoing discourse on the intersection of AI and fraud detection by presenting empirical evidence, critical analysis, and practical ideas. It aims to catalyze advancements in the development of transparent, reliable, and effective AI systems by highlighting the strengths, limitations, and future research directions of various classification algorithms, with profound implications for sectors reliant on fraud detection technologies.

Language: English
Submitted on: Apr 25, 2024
Published on: Aug 6, 2024
Published by: Professor Subhas Chandra Mukhopadhyay
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2024 Erum Parkar, Shilpa Gite, Sashikala Mishra, Biswajeet Pradhan, Abdullah Alamri, published by Professor Subhas Chandra Mukhopadhyay
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.