Optimizing Banking Sector Loan Fraud Detection through Machine Learning Methods

doi:10.21203/rs.3.rs-7358678/v1

Optimizing Banking Sector Loan Fraud Detection through Machine Learning Methods

2025 · doi:10.21203/rs.3.rs-7358678/v1

preprint OA: closed

Full text JSON View at publisher

Full text 167,745 characters · extracted from preprint-html · click to expand

Optimizing Banking Sector Loan Fraud Detection through Machine Learning Methods | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Optimizing Banking Sector Loan Fraud Detection through Machine Learning Methods Saddam Bekhet, Fahd Sabry Esmail, Amal Elsayed Aboutabl, Amr M. Abdelaziz, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7358678/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Loan fraud has been a permanent challenge for the financial sector. It is crucial to ensure the stability of economic and customer trust. Predicting loan fraud is extremely important to eliminate the possibility of the occurrence of crises like the subprime mortgage crisis in 2008. Moreover, the immense number of loan applicants makes it impossible for employees to perform this task manually, especially considering the different number of parameters that need to be investigated. This paper proposes automated artificial intelligence models to detect loan fraud through predictive analysis techniques, with an emphasis on the integration of neural networks and deep learning techniques. The approach combines the autoencoder-based architecture with gradient boosting to ensure the detection of fraud activities. The model was applied to an online dataset from Kaggle, which contains 100,000 credit loan transactions. The model achieves optimal accuracy in fraud detection. This emphasizes the effectiveness of combining deep learning techniques with the autoencoder-based architecture for fraud detection. Additionally, the promising results present the effectiveness of the dimensionality reduction techniques of feature space to enhance the accuracy of the proposed models. Physical sciences/Engineering Physical sciences/Mathematics and computing Loan Fraud Detection Neural Network Deep Learning Autoencoder Gradient Boosting Data Mining Techniques Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 Figure 12 Figure 13 Figure 14 1. Introduction In recent times, instances of fraudulent activities have increased in both corporate entities and the banking sector. The proliferation of information technology has played a significant role in this evolution, creating disruptions across various industries. Corporations and organizations resort to financial crime and fraud as part of their routine operations. The landscape of fraudulent activities is expanding continuously, leading to escalating costs and heightened expectations from clients. The consequences of fraud include financial losses, increased expenses related to examinations and court cases, erosion of customer confidence, and damage to brand repute. It has become a formidable challenge in the corporate world. Before the advent of information technology, manual methods were the only means to identify fraudsters. However, these methods were limited in their effectiveness, being both simplistic and time-consuming [ 1 – 2 ]. Contemporary society witnesses various types of financial operations fraud, such as fraudulent practices in bank loan administration and credit processes [ 3 – 4 ]. To address this issue promptly, sophisticated modern technology is indispensable. While existing bank credit management and fraud detection procedures aim to minimize false alarms, they often fall short in achieving the required precision. Factors like undefined fraud parameters, insufficient data, and the presence of fraudulent duplicates negatively impact prediction accuracy [ 5 – 6 ]. Fraud, as distinct from a mere mistake, involves individuals engaging in illegal activities for personal gain at the expense of legitimate entities or individuals. The consequences of financial sabotage extend to the broader economy, encompassing activities such as money laundering, tax evasion, counterfeit fraud, credit card fraud, fraud in cooperative societies, and bank credit fraud [ 7 – 9 ]. These fraudulent activities adversely affect the country's economy, and though various measures have been attempted, their success has been mixed [ 10 – 11 ]. The current increase in loan applications is remarkable because many people use loans for various reasons [ 12 – 14 ]. Research shows that non-payers impair others' access to bank and other loans. These actions prevent borrowers from getting loans [ 15 – 16 ]. Inefficient bank debt management causes financial crises when risky and unsuitable lending persists, causing losses for banks and lending organizations [ 14 ]. Since credit risk is a major issue in banking, loan defaulters and their dangers have garnered attention [ 17 – 18 ]. Alarmingly, the rate of loan defaults causes financial losses for banks [ 19 ], resulting in the closure of several banks and subsequent job losses in the banking sector and related industries. When a moneylender fails to fulfill the terms of a loan agreement, it constitutes a breach of the law. Non-payment of agreed-upon installment repayments by defaulters is referred to as a non-performing debt, depending on the type of loan and the time elapsed since nonpayment (usually 90 or 189 days). The classification of a loan as non-performing is contingent on the provisions outlined in existing agreements and promissory notes. This study utilizes past loan data through data mining techniques to anticipate fraud in the management of bank loans and stop loan defaults, a task challenging for manual examination by credit officers. Data mining reveals hidden patterns not easily discerned through traditional or statistical approaches, which have limited accuracy potential due to the vast and diverse nature of the data. By employing data mining techniques, it becomes easier to differentiate between borrowers who promptly repay loans and those who do not. Assessment of a borrower's purchasing habits and reliability aids in predicting defaults and determining creditworthiness. Banks can utilize data mining methods to profile and assess the risk of loan fraud across their branches, leveraging critical loan details encompassing lender and borrower names, loan location, and modifications to the loan. This data serves as the initial checkpoint to identify irregularities in the loan application process. Banks are obligated to follow precise instructions, including verifying compliance with credit requirements and confirming genuine ownership of the payback account by the credit owner. Accurate prediction of factors influencing repayment is vital. A comprehensive validation of personal information, including income, employment, identity, and credit situation, is crucial in determining whether to post this information on the borrowing homepage. Predicting future repayment behavior through borrower verification is imperative, as lending institutions assign scores to borrowers based on an assessment procedure. This study emphasizes the significance of credit status in predicting defaults, focusing on borrower certifications such as credit score, job, income, and housing. The literature review in the second section explores loan fraud and data mining-based prediction models. Section 3 identifies different data mining algorithms used to create the predicted model, while Section 4 discusses the research findings and the employed dataset. The conclusion in Section 5 summarizes the study's contributions and provides insights into addressing loan fraud in the banking sector. In summary, this work addresses the detection of loan fraud, employing various data mining methods to create a model that aids the banking sector in understanding the underlying causes of loan fraud. The proposed model serves as a warning to financial decision-makers, helping prevent loan fraud. The article presents novel findings, contributing to the existing literature in this field. 2. Related work We examine fraud detection sequential models and data mining methods in this part. We thoroughly review several credit applications and transaction records. The biggest challenge is binary classification, where credit transactions are legal or fraudulent. In the realm of predictive analytics, data mining plays a pivotal role. Defined as the information extraction process from large the information extraction process from large datasets, data mining is an integral part of knowledge discovery in databases (KDD). Although closely associated with KDD, data mining functions independently, aiming to identify patterns in various data mining tasks. Data mining techniques use datasets as the foundation for creating models, with algorithms learning from these datasets to anticipate specific input outcomes [ 20 ]. This knowledge acquisition process does not impact the storage of data on workstations but does influence their operations to facilitate future improvements [ 21 ]. A methodology in [ 22 ] analyses bank loan applicants' default risk with high accuracy and precision. [ 23 ] uses decision trees to predict key credibility traits to identify reputable bank loan applicants. [ 24 ] uses banking sector data to create a loan state prediction model using classification algorithms like J48, Bayes Net, and Naive Bayes, with J48 being the most accurate. [ 25 ] uses an upgraded risk prediction clustering multi-dimensional algorithm and the Association Rule to detect unsuitable main and secondary loan applicants. In [ 26 ], a decision tree model classifies and a genetic technique selects features, confirmed by Weka. [ 27 ] introduces two data mining models for credit scoring, aiding Jordanian banks in credit decisions, demonstrating the regression model's superiority in accuracy over the radial function model. [ 28 ] develops multilayer-based credit scoring models, outperforming logistic regression methods, with the neural network model proving the most effective. [ 29 ] compares credit scoring models using support-vector machines, revealing broad-definition models outperforming restricted-definition models. [ 30 ] conducts financial data analysis using methods like Bayes classification, decision trees, random forests, boosting, and more, incorporating multilayer perceptions, logistic regression, Neural networks and support vector machines in the model. The analysis reveals exceptional performance. [ 31 ] presents a discrete survival model for the Italian banking system to investigate failure risk and provide experimental validation. [ 32 ] uses client information from the business sector of a retail bank, as well as rural and urban client information from the business sector of a retail bank, as well as rural and urban Bangladeshi records to predict retail banking business locations. The study uses Weka for decision tree data mining. Another study uses decision trees (DT), artificial neural networks (ANN), and feature selection approaches including the genetic algorithm (GA) and principal component analysis to create a credit rating model for Sudanese banks (PCA). ANN outperforms DT in most circumstances and GA outperforms PCA in feature selection using German and Sudanese credit datasets. ANN surpasses DT, GA-DT, and PCA-DT on the German dataset with 80.67 percent accuracy. The authors of [ 33 ] proposed an innovative approach for classifying credit risk in the banking sector, advocating the use of data mining. The data utilized in this model originated from the banking sector, specifically aiming to forecast loan conditions. The projected models were created using three distinct algorithms: J48, BayesNet, and Naive Bayesian. Implementation and testing were carried out using the Weka program. The study's outcomes revealed that, in terms of accuracy, J48 performed the best. This investigation delved into the accuracy of predictions from five credit risk classifiers under various types of interference, exploring how classifier ensembles could enhance precision. However, A bigger difficulty is dealing with client behavior shifts and fraudsters' ability to adapt and generate new patterns. Models designed for fraud identification can assist in such cases by detecting anomalies through unsupervised learning approaches. Carcillo et al. introduced a hybrid strategy in 2019 [ 34 ] that combines supervised and unsupervised methods to enhance fraud identification accuracy. The approach involves examining and evaluating unsupervised anomaly ratings generated from an actual labeled credit card fraud identification dataset at different levels of granularity. Experimental results support the effectiveness of this combination, which also improves identification precision. Economic fraud has become a significant concern impacting the financial system, particularly in internet transactions. Data mining is one technique employed to identify credit card fraud. Identifying credit card theft is challenging due to changing features of fraudulent and legitimate behavior over time, coupled with heavily skewed datasets. In 2020, Bagga et al. [ 35 ] established a framework to analyze the effectiveness of various techniques on credit card fraud data. Techniques included ensemble learning, multilayer perceptrons, random forests, Naive Bayes, KNN, AdaBoost, and logistic regression. The efficacy of fraud detection is influenced by the factors and techniques employed. The study aimed to assist the banking sector in leveraging loan data for credit analysis, providing valuable information to the credit decision-making process. The research supports the banking industry by reducing the time and financial resources spent on loan reviews. Additionally, it minimizes the vulnerability experienced by loan authorities by furnishing them with insights gleaned from previous loan data through data mining methods. 3. Methodology A. Data sets The dataset comes from Kaggle, a platform for machine learning and data scientists. Data from credit loan transactions is included. Five basic data mining approaches are likely used to compare accuracy. The collection contains 100,000 transaction-related customer records. All dataset attributes are precise, integer-based, and multivariate. Principal component analysis and autoencoder input continuous (numerical) variables. Table 1 lists 10 input parameters that the model is trained and tested using. A hybrid oversampling and under sampling approach creates two distribution groups to handle dataset imbalances. The experimental credit loan fraud detection setup uses SAS Enterprise Miner. Table 1 Descriptions of features for detecting loan fraud Feature Description Financial Report Creditworthiness Salary Per Year annual salary of the debtor Term Words used to describe different types of loans, such as short-term and long-term Loan Amount at Present Time The present loan's amount Purpose The Goals of Obtaining a Loan: Consolidating Debt, Purchasing a Vehicle, Funding Higher Education, Paying for Expenses Like Weddings and Vacations, Starting a Small Business, Making a Large Purchase, and More Maximum Open Credit Maximum allowable credit Extensive Credit Record Duration of the credit report The Total Amount Due on Credit Currently The sum total of all of your outstanding credit card balances Buying a House There are a few different ways to own a home: through a mortgage, outright purchase, or rental. There Are Several Credit Issues Several credit problems B. Related Methods Data mining is the process of discovering relationships, trends, and outliers within large databases in order to forecast outcomes. Various approaches can be employed to identify instances of loan fraud and reduce risks associated with this information. We have applied and analyzed five data mining algorithms. When it comes to data mining categorization problems, decision trees stand out as a popular method. Using supervised learning, this technique instructs the model to classify objects according to the types of data given to it. Picking splits strategically is crucial to a tree's accuracy. The goal of using approaches like Gini to split nodes into sub-nodes in decision trees is to increase homogeneity. A node's purity grows in relation to the target variable [ 25 ]. Picking the split that produces the most similar sub nodes while taking into account all available features is the goal (see Fig. 1 ). An ensemble classifier, the random forest (RF) combines several decision trees. The goal of utilizing multiple trees is to train them effectively, with each tree adding structure to the model (see Fig. 2 ). After the trees are built, the results of them are combined. Classification and regression problems are both amenable to random forests [ 36 ]. As a paradigm for feedforward artificial neural networks, Multilayer Perceptron Classifiers (MLPs) guide input data toward a variety of useful outputs. Input, output, and hidden layers make up the system's three levels. A signal is received by the input layer so that it can be processed [ 37 ]. You can see that the backpropagation Algorithm for MLP Training. Figure 3 shows an example of a neural network with an MLP architecture. To categorize intricate datasets, the hidden layer is essential. An illustration of a model with probabilities is called a Bayesian network. illustration of a probabilistic model that applies the Bayes theorem to analyze the conditional dependency structure of a collection of random variables: When we think about occurrences A and B, we have P(A|B) for A provided that B is true, P(B|A) for B given that A is true, and P(A) and P(B) for B and A, respectively, that stand as independent probabilities. Two separate events, A and B, have occurred. [ 38 ] Training Bayesian classifiers is faster than other methods, although learning them could take more time. Similar to previous boosting methods, Gradient boosting, sometimes referred to as a statistical prediction model, makes differential loss functions amenable to optimization and expansion. Gradient boosting is a common component of algorithms used for regression and classification. A regression approach similar to boosting is gradient boosting [ 39 ]. In gradient boosting, applied to a given training dataset D = xi, yi, N, the aim is to construct an approximation of the function F0(x). This approximation minimizes the expected value of a specific loss function, linking instances x to their corresponding output values y, denoted as L(y, F(x)). Gradient boosting generates an additive approximation of F0, presenting a weighted sum of functions, as illustrated in Fig. 4 . A comprehensive examination of when to utilize PCA and autoencoder, two widely employed dimensionality reduction methods in machine learning research, was essential. Both autoencoder and principal component analysis are effective for linear and non-linear surfaces, marking a significant distinction between these two methods. An autoencoder is an unsupervised deep learning method. The number of inputs and outputs in this straightforward feed-forward network is exactly equal. The lower-dimensional code, obtained from the higher-dimensional code, can faithfully reproduce the input. It is often referred to as a latent space representation, serving as a condensed knowledge representation. Given its primary use in input reconstruction, the training process must aim for maximum precision [ 40 ]. The architecture includes both encoders and decoders, as illustrated in Fig. 5 , showcasing a typical autoencoder. The encoder and decoder exhibit similarities, essentially being counterparts of each other. Comprising convolutional blocks and coordinated components, the encoder contributes to providing input to the autoencoder. The encoder features a node count identical to the dataset's number of features. The output from the encoder is directed to the compressed layer, commonly referred to as the latent space. One way to reduce dimensionality is by using principal component analysis (PCA), which involves decreasing the range of lower-order dimensions to a reasonably compact set. Reducing the number of features or dimensions helps with overfitting mitigation and makes it easier to see the link between variables. In order to keep information intact while reducing dimensions, it is crucial to avoid losing any details in the pursuit of minimizing features. This is why principal component analysis (PCA) was developed; it is very good at reducing the number of dimensions while keeping data intact [ 26 ]. The principle-concept analysis (PCA) algorithm is shown in Fig. 6 . Model for an Ensemble. To improve overall performance, ensemble strategies combine different models. The model with the most votes at the end of each test case is used to calculate the final output prediction in majority voting ensemble models. In order to create a new model, the ensemble takes a function of the posterior probability or projected values of different models and applies them to the problem at hand. This is how the majority voting algorithm works:(see fig 7): 4. Results and the suggested model A number of data mining approaches have been utilized, as shown in Fig. 8 of the proposed model. These techniques include ensemble models, decision trees, gradient boosting, autoencoder, Bayesian networks, and random forests. Furthermore, the best data mining algorithms have been identified using a variety of performance metrics, including F1 score, ROC separation, accuracy, cumulative lift, and lift. A number of steps, including preprocessing, modification, modelling, and evaluation, are involved in tackling the problem of malicious entity identification. Data cleaning was performed in the preprocessing stage to ensure it met the quality standards. Afterwards, in the modification step, we used a variety of strategies to decrease the initial feature space's dimensionality. After that, we built the models and evaluated their quality; this allowed us to choose the best model and establish some baseline characteristics for fraud detection. We used the k-nearest neighbors method to complete any gaps in the data, the support vector machine method to get rid of anomalies, and statistical analysis to get rid of outliers. Approaches such as decision trees, artificial neural networks, gradient boosting, random forests, Bayesian networks, and ensemble methods combining decision trees and neural networks were used to solve classification problems. To put the suggested categorization approach into action, SAS Enterprise Miner was utilized. To create new features, SAS Enterprise Miner provides two methods via its "Feature Extraction" node: principal component analysis (PCA) and autoencoder. To get the best results, both procedures were used, and then we compared the two. Following model training on the training data, hyperparameters were adjusted using the validation set. Table 2 displays a performance metric for several approaches. Table 2 Several algorithms' performance metrics Model Name Accuracy False Positive Rate F1 Score Area Under ROC Neural Network 0.8202 0 0 0.52 Gradient Boosting 0.8606 0.032198 0.504996 0.89707 Gradient Boosting and PCA 1 0 1 1 Autoencoder and Gradient Boosting 1 0 1 1 Forest 0.8270 0.005393 0.10426 0.827042 Decision Tree 0.8240 0.013544 0.130786 0.752008 Ensemble 0.822 0 0 0.759661 Bayesian Network 0.8033 0.119992 0.44835 0.759582 Performance measures are depicted visually in Fig. 9 . Our binary classification research incorporates a comprehensive presentation of accuracy, false positive rate, F1 scores, and the area under the ROC achieved for each classification model. A. Cumulative lift To calculate the cumulative lift, we sort all partitions from most likely to least likely based on the target event's anticipated probability, where for the target name, Credit Status is the anticipated probability of the event "Fraud."(P Credit Status fraud). The total number of events is calculated across all quantiles after splitting the data divided the data into 20 quantiles (demideciles), each of which represents 5% of the entire data. Cumulative lift values for several algorithms in the train and validation partitions are shown in Fig. 10 . The ratio of the number of events that would occur randomly or consistently to the number of events that occur at the present quantile is the cumulative lift for that quantile. Another name for it is the ratio of the total response rate to the initial response rate. The first two quantiles, which stand for the highest 10% of data and the randomly distributed 10%, make up the cumulative lift at depth 10. It is far more probable to see an event in quantiles than to randomly select observations, according to cumulative lift estimates. B. Lift measure To generate the lift measure, each division is sorted in decreasing order based on the expected likelihood of the target event. P credit status fraud "Yes" indicates the likelihood of the expected fraud incident occurring. With each quantile representing 5% of the total data, we calculated the total occurrences in each of the 20 quantiles, also known as deciles. Lift is the percentage of responses that deviate from the baseline response percentage when comparing a quantile to the percentage of events that would happen equally or at random. Lift measurements illustrate the distinction between predicting 5% of occurrences in each of the 20 quantiles and selecting observations at random. This stands in contrast to the probability of seeing an event in every quantile. As can be seen in Fig. 11 , different algorithms have different lift measure values in the train and validation partitions. C. Sensitivity assessment Using a variety of cut-off values, the ROC curve graphically displays the relationship between sensitivity and specificity; this relationship is obtained from the confusion matrix. At the 1-specificity value, where the validated partition shows the greatest significant difference between 1-specificity and sensitivity, the Kolmogorov-Smirnov (KS) reference line is drawn to help choose the appropriate cut-off for data assessment. The sensitivity measures for all the different methods in the training and validation sets are displayed in Fig. 12 . Figure 12 illustrates that sensitivity measurements for different algorithms have different values in the train and validation partitions. Contrasting the Kolmogorov-Smirnov in statistical terms, the empirical distribution function of the sample is contrasted with the actual distribution functions of the two models or the cumulative distribution function of the reference distribution. The fit gap is considered significant when the KS value is less than 0.05. D. Accuracy The accuracy metric, which is assessed across multiple threshold levels, represents the percentage of observations that are accurately classified as events or nonevents. Cut-off values fall between 0 and 1 and are measured in steps of 0.05. All cut-off values take into consideration the predicted goal category. The likelihood that the desired credit status event "yes" will occur in this instance is the cut-off value, and it is written as P credit status fraud. P credit status fraud is classified as having happened in the anticipated category if it is larger than or equal to the cut-off figure; if not, it is considered to have been insignificant. Correct sorting is defined as true positives for the predicted classification and true negatives for the original categories, respectively. If the perceived sorting differs from the real classification, the observation is sorted wrongly. To find the precision, use the following formula. The accuracy measurement findings for the training and validation partitions of different methods are variable, as Fig. 13 illustrates. E. F1 score A classification metric called the F1 score is evaluated at different cut-off points and is derived from the confusion matrix. It takes precision and recall into account (or sensitivity). There is a 0.05-unit increment from 0 to 1 for the cut-off values. By classifying the prediction target, we can ensure that all cut-off values take P credit status into consideration. In this case, "yes" indicates that the estimated likelihood of the target fraud event is greater than or equal to the cut-off number. When the value is higher than or equal to the cut-off value, predictions of occurrences in P credit status fraud are positive; otherwise, they are negative. The varying F1 score values for the training and validation phases of the various techniques are displayed in Fig. 14 . 5. Models Discussion Tables 3 – 8 show the many the optimal model for production deployment is selected based on statistical criteria. Root average squared error, average squared error, and misclassification rate, Gini coefficient, multi-class log loss, and Kolmogorov-Smirnov are some examples of these measures (KS). A. Gini coefficient The Gini coefficient is a metric utilized to assess the level of discrimination within a population. Table 3 Gini coefficient Name of Model Train Validate Neural Network 0 0 Forest 0.69341 0.65489 Ensemble 0.52656 0.51932 Bayesian Network 0.52201 0.51917 Gradient Boosting 0.83017 0.794035 Autoencoder and Gradient Boosting 1 1 PCA and Gradient Boosting 1 1 Decision Tree 0.51265 0.5041 The Gini coefficient is an equality measure ranging from 0 to 1, where 0 signifies perfect equality and 1 represents perfect discrimination [ 23 ]. A better model outcome is indicated by the neural network's smaller Gini in the two-partition dataset. B. Misclassification rate A performance metric called the misclassification rate shows the proportion of inaccurate forecasts without making a distinction between forecasts that are positively and negatively erroneous.[ 24 ]. Table 4 Misclassification rate Name of Model Train Validate Neural Network 0.47041 0.47040 Forest 0.22146 0.22757 Ensemble 0.23166 0.23303 Bayesian Network 0.46404 0.47103 Gradient Boosting 0.16720 0.18270 Autoencoder and Gradient Boosting 0.00273 0.00287 PCA and Gradient Boosting 0.00264 0.00290 Decision Tree 0.23036 0.23147 The gradient boosting and autoencoder models in the validated dataset partition exhibit a lower rate of misclassification when compared to other models. C. Average Square Error (ASE) A model is considered perfect when its Average Square Error (ASE) value is zero; the lower the number, the better the model's performance [ 25 , 26 ]. Table 5 Average Square Error Name of Model Train Validate Neural Network 0.11569 0.11569 Forest 0.03885 0.04218 Ensemble 0.01399 0.01405 Bayesian Network 0.00213 0.00222 Gradient Boosting 0.05503 0.05535 Autoencoder and Gradient Boosting 0.04985 0.05193 PCA and Gradient Boosting 0.07020 0.07035 Decision Tree 0.10308 0.10435 Root average squared error (RASE) assigns additional weight to large errors. Table 6 Root average squared error Name of Model Train Validate Neural Network 0.34014 0.34013 Forest 0.19711 0.20538 Ensemble 0.11827 0.11855 Bayesian Network 0.04611 0.04707 Gradient Boosting 0.23458 0.23526 Autoencoder and Gradient Boosting 0.22326 0.22788 PCA and Gradient Boosting 0.26495 0.26524 Decision Tree 0.32107 0.32303 The validated dataset partition's PCA and Gradient Boosting model attains the optimal values for both ASE and RASE. D. Multi-Class Log Loss Multi-class log loss is the sum of the log loss values for each class prediction and has a well-known multi-class extension. Table 7 Multi-class log loss Name of Model Train Validate Neural Network 1.37291 1.37281 Forest 0.38279 0.41653 Ensemble 0.30525 0.30596 Bayesian Network 0.09472 0.09574 Gradient Boosting 0.53605 0.54760 Autoencoder and Gradient Boosting 0.48040 0.50637 PCA and Gradient Boosting 0.83034 0.83315 Decision Tree 0.93514 0.95596 Multi-Class Log Loss (MCLL) values below 0.1 indicate a better fit. All metrics indicate that the PCA and gradient-boosting model is optimal. E. Kolmogorov-Smirnov (KS) The Kolmogorov-Smirnov test is employed to determine whether a sample is normally distributed or not. Table 8 KS (Youden) Name of Model Train Validate Neural Network 0.0000 0.0000 Gradient Boosting 0.6784 0.6496 Autoencoder and Gradient Boosting 1.0000 1.0000 PCA and Gradient Boosting 1.0000 1.0000 Decision Tree 0.4877 0.4821 Forest 0.5498 0.5196 Ensemble 0.4877 0.4818 Bayesian Network 0.4728 0.4719 The Kolmogorov-Smirnov statistic measures the greatest difference in cumulative distributions, where 1 signifies the best match for events and 0 indicates the worst for non-events. In the two-partition dataset, the autoencoder and gradient boosting, as well as PCA and gradient boosting, indicate the outcome of a superior model. 6. Conclusion The lending industry's viability has been severely threatened by the recent uptick in borrower payback defaults, which has led to a dramatic rise in bank loan fraud. According to new studies, current risk assessment techniques could miss important signals and risk indicators that have an impact on repayment. Bank loan fraud has increased dramatically with the introduction of new technologies, costing financial institutions a tonne of money every year. Installing a fraud detection system that keeps tabs on all transactions in real time is vital for preventing and discouraging theft. In order to protect their reputations, financial institutions should invest in fraud management technology. This will not only minimise risks but also lower fraud expenses. To get to the bottom of these major loan fraud problems, an ensemble model was used, which combines different data mining techniques. The best data mining techniques were selected based on performance metrics such as accuracy, lift, cumulative lift, and F1 score. To further assist in selecting the best model for production situations, models that fit statistical measures, including average square error, misclassification rate, Gini coefficient, root average squared error, multi-class log loss, and Kolmogorov-Smirnov (KS), were suggested. The results showed that a smaller value more faithfully reflected the flawlessness of the model. Nevertheless, it was pointed out that at this time, there isn't a single model that can be considered best for every company situation. Still, the chosen model worked well for most purposes and produced the expected results. Finally, the best models for identifying loan fraud were the autoencoder and the gradient-boosting classifier, which achieved the greatest accuracy. Based on these results, autoencoders and gradient boosting could greatly improve banking sector and fraud detection strategies used today for classification. Fund The authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through the Small Research Project under grant number RGP1/38/46 . Ethics The authors declare that the work presented in this manuscript is original and has been conducted in accordance with ethical standards. No human participants, animals, or identifiable data were involved in this study. All data used are publicly available on the Kaggle website, and no ethical approval was required. Abbreviations DT Decision Trees ANN Artificial Neural Networks GA Genetic Algorithm PCA Principal Component Analysis RF Random Forest KS Kolmogorov-Smirnov ASE Average Square Error RASE Root Average Squared Error MCLL Multi-Class Log Loss Declarations Author Contribution The authors have contributed equally in the preparation of the manuscript, including: analysis, scientific formalization methodology, numerical experiments, result analysis, writing, and reviewing. Data Availability The data is publicly available on the Kaggle website References Jurgovsky, Johannes, “Sequence classification for credit-card fraud detection,” Expert Systems with Applications, 2018, pp.234-245. Aishwarya, S.Dhivya, K. Devika Rani, “Online payment fraud prevention using cryptographic algorithm TDES,” Int J Comput Sci Mob Comput, Vol.4, No.4, 2015, pp.317-323. Agwu, M. Edwin, “Reputational risk impact of internal frauds on bank customers in Nigeria,”.International Journal of Development and Management Review, 9.1, 2014, pp.175-192. Agaba; CALEB, Tamwesigire; ETON, Marus, “Credit Risk Management Practices and Loan Performance of Commercial Banks in Uganda,” 2022. Rashid, Md Abdur, “An overview of corporate fraud and its prevention approach,” Australasian Accounting Business & Finance Journal,Vol.16, No.1, 2022, pp.101-118. Thomas, Manju Susan; Mathew, Juby, “Supervised Machine Learning Model for Automating Continuous Internal Audit Workflow,” In: 2022 6th International Conference on Trends in Electronics and Informatics (ICOEI), IEEE, 2022, pp.1200-1206. Fahd Sabry Esmail, Fahad Kamal Alsheref, Amal Elsayed Aboutabl, “Review of Loan Fraud Detection Process in the Banking Sector Using Data Mining Techniques,” International journal of electrical and computer engineering systems, Vol.14, No.2, 2023, pp.229-239. MOROKE, Ntebogang Dinah; MAKATJANE, “KatlehoPredictive Modelling for Financial Fraud Detection Using Data Analytics: A Gradient-Boosting Decision Tree,” Applications of Machine Learning and Deep Learning for Privacy and Cybersecurity. IGI Global, 2022, pp.25-45. TANG, Jiali; KARIM, Khondkar E, “Financial fraud detection and big data analytics–implications on auditors’ use of fraud brainstorming session,” Managerial Auditing Journal, Vol.34, No.3, 2019, pp.324-337. Mehmet, N. A. R. , “The effects of behavioral economics on tax amnesty. International Journal of Economics and Financial Issues,”Vol.5, No.2, 2015, pp.580-589. Randhawa, Kuldeep, “Credit card fraud detection using AdaBoost and majority voting,” IEEE access 6, 2018, pp.14277-14284. Oreski, S., & Oreski, G.,“Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert systems with applications,” Vol.41, No.4, 2014, pp.2052-2064. Yannelis, Constantine; Tracey, Greg., “Student loans and borrower outcomes ,” Annual Review of Financial Economics, 14, 2022, pp. 167-186. Metawa, N., Hassan, M. K., & Elhoseny, M.,“Genetic algorithm based model for optimizing bank lending decisions,” Expert Systems with Applications, 80, 2017, pp.75-82. Nascimento, Paulo Augusto Meyer Mattos,“Modelling income contingent loans for higher education student financing in Brazil,” 2019. Chapman, B., & Sinning, M. Student loan reforms for German higher education: financing tuition fees,” Education Economics, Vol.22, No.6, 2014, pp. 569588. Infant Cyril, G. L., & Ananth, J. P., “Deep learning based loan eligibility prediction with Social Border Collie Optimization,” Kybernetes, 2022. Akkoç, S.,“An empirical comparison of conventional techniques, neural networks and the three-stage hybrid Adaptive Neuro-Fuzzy Inference System (ANFIS) model for credit scoring analysis: The case of Turkish credit card data,” European Journal of Operational Research,Vol. 222, No.1, 2012, pp.168178. Byanjankar, A., Heikkilä, M., & Mezei, J., “Predicting credit risk in peerto-peer lending: A neural network approach,” In 2015 IEEE Symposium Series on Computational Intelligence, December 2015, pp. 719-725. Imran, Imran, “Using machine learning algorithms for housing price prediction: the case of Islamabad housing data,” Soft Computing and Machine Intelligence, Vol.1, No.1, 2021, pp.11-23. Witten, H.; and Frank, E., “Data mining: practical machine learning tools and techniques.,” San Francisco, CA: Morgan Kaufmann, 2017 Sudhamathy G and Jothi Venkateswaran, “Analytics Using R for Predicting Credit Defaulters,” IEEE international conference on advances in computer applications (ICACA), 2016, pp.66-71. M. Sudhakar, and C.V.K. Reddy, “Two Step Credit Risk Assessment Model For Retail Bank Loan Applications Using Decision Tree Data Mining Technique,” International Journal of Advanced Research inComputer Engineering & Technology (IJARCET), Vol. 5, No.3, 2016, pp.705-718. J.H. Aboobyda, and M.A. Tarig, “Developing Prediction Model Of Loan Risk In Banks Using Data Mining,” Machine Learning and Applications: An International Journal (MLAIJ), Vol. 3, No.1, 2016, pp.1–9. K. Kavitha, “Clustering Loan Applicants based on Risk Percentage using K-Means Clustering Techniques,” International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 6, No.2, 2016, pp.162–166. Z. Somayyeh, and M. Abdolkarim,“Natural Customer Ranking of Banks in Terms of Credit Risk by Using Data Mining A Case Study: Branches of Mellat Bank of Iran,” Jurnal UMP Social Sciences and Technology Management, Vol. 3, No. 2, 2015, pp. 307–316. A.B. Hussain, and F.K.E. Shorouq, “Credit risk assessment model for Jordanian commercial banks: Neuralscoring approach,” Review of Development Finance, Elsevier, Vol. 4, 2014, pp. 20–28. A. Blanco, R. Mejias, J. Lara, and S. Rayo, “Credit scoring models for the microfinance industry using neural networks: evidence from Peru,” Expert Systems with Applications, vol. 40, 2013, pp. 356–364. T. Harris, “Quantitative credit risk assessment using support vector machines: Broad versus Narrow default definitions,” Expert Systems with Applications, vol. 40, 2013, pp. 4404–4413. Dileep B. Desai, Dr. R.V.Kulkarni, “A Review: Application of Data Mining Tools in CRM for Selected Banks,” International Journal of Computer Science and Information Technologies (IJCSIT), Vol. 4, No.2, 2013, pp.199-201. G. Francesca, “A Discrete-Time Hazard Model for Loans: Some Evidence from Italian Banking System,” American Journal of Applied Sciences,Vol.9, No.9, 2012, pp. 1337–1346. Rafiqul, I; and Ahsan H., “A data mining approach to predict prospective business sector for lending in retail banking using decision tree,” International Journal of Data Mining & Knowledge Management Process, Vol.5,No.2, 2015, pp.13-18. Jafar, A. Mohammed T., “Developing prediction model of loan risk in banks using data mining. Machine Learning and Applications,” An International journal, Vol.3, No.1, 2016, pp.1-8. F. Carcillo, Y.-A. Le Borgne, O. Caelen, Y. Kessaci, F. Oble, and G. Bontempi, “Combining unsupervised and supervised learning in credit card fraud detection,” Information Sciences, vol.557, 2021, pp. 317–331. S. Bagga, A. Goyal, N. Gupta, and A. Goyal, “Credit card fraud detection using pipeling and ensemble learning,” Procedia Computer Science, vol.173, 2020, pp. 104-112. Xuan, Shiyang, “Random forest for credit card fraud detection,” In: 2018 IEEE 15th international conference on networking, sensing and control (ICNSC). IEEE, 2018, pp.1-6. Jana, Dipak Kumar, “Optimization of effluents using artificial neural network and support vector regression in detergent industrial wastewater treatment,” Cleaner Chemical Engineering,Vol. 3, 2022, pp.100039. Km, Anil Kumar, “Detection of False Income Level Claims Using Machine Learning,” International Journal of Modern Education & Computer Science, Vol.14, No.1, 2022. Yao, S., “Gradient boosted decision trees for combustion chemistry integration,” Applications in Energy and Combustion Science, Vol.11, 2022, pp.100077. Abhaya, Abhaya; PATRA, Bidyut Kr., “An efficient method for autoencoder based outlier detection,” Expert Systems with Applications, Vol.213, 2023, pp.118904. Fahd Sabry Esmail, Fahad Kamal Alsheref, and Amal Elsayed Aboutabl, Enhancing loan fraud detection process in the banking sector using data mining techniques, Indonesian Journal of Electrical Engineering and Computer Science, Vol 32 No. 2, 2023. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7358678","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":523680035,"identity":"31871795-9522-43bf-b155-a32da3eb9f94","order_by":0,"name":"Saddam Bekhet","email":"","orcid":"","institution":"South Valley University","correspondingAuthor":false,"prefix":"","firstName":"Saddam","middleName":"","lastName":"Bekhet","suffix":""},{"id":523680036,"identity":"c2f471df-9c3c-4b9f-a4be-d96dc6142184","order_by":1,"name":"Fahd Sabry Esmail","email":"","orcid":"","institution":"Helwan University","correspondingAuthor":false,"prefix":"","firstName":"Fahd","middleName":"Sabry","lastName":"Esmail","suffix":""},{"id":523680037,"identity":"a05617ba-3386-4169-bb13-3452c70ac9e3","order_by":2,"name":"Amal Elsayed Aboutabl","email":"","orcid":"","institution":"Helwan University","correspondingAuthor":false,"prefix":"","firstName":"Amal","middleName":"Elsayed","lastName":"Aboutabl","suffix":""},{"id":523680038,"identity":"5298c86a-f88f-4aee-a077-90134799c336","order_by":3,"name":"Amr M. Abdelaziz","email":"","orcid":"","institution":"Beni-Suef University","correspondingAuthor":false,"prefix":"","firstName":"Amr","middleName":"M.","lastName":"Abdelaziz","suffix":""},{"id":523680039,"identity":"f4a4d359-ef4f-480e-8188-2072598d04af","order_by":4,"name":"Fahad Kamal Alsheref","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA9ElEQVRIiWNgGAWjYBAC+wM8DBIMBQwMBgwMhg8bDoDEEvBrMWAAaTEAazE2JFmLmSRxWtjPHrzNY3BY3py9eVvljDOHGfjZcwwYf9Tg8QtPXrI1UIvhzp5jZTc33DjMINnzxoCZ5xg+h+WYSfMY3GbccCPH7OaDD4cZDG7kGDAzsOHRwv8GrMV+w/03ZoUgLfY3QA77h0eLBMSWxA03eMwYQQ4Dihgw8Lbh0/Iu2XKOwf/kDWfSiiVnnEnnkTjzrOAwbx8+h+UevPGmIs12w/HDGz/2HLOW429P3vjwxzfcWjAAD4g4QIKGUTAKRsEoGAVYAABn+VcHzftB7gAAAABJRU5ErkJggg==","orcid":"","institution":"King Khalid, University","correspondingAuthor":true,"prefix":"","firstName":"Fahad","middleName":"Kamal","lastName":"Alsheref","suffix":""}],"badges":[],"createdAt":"2025-08-12 19:08:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7358678/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7358678/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":92718632,"identity":"c442a219-c021-4a02-a7c6-aefac8723587","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1968123,"visible":true,"origin":"","legend":"","description":"","filename":"BankFraud21Aug25.docx","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/3ee5249762ed560866e9ab88.docx"},{"id":92718624,"identity":"a40e8f47-ad8a-4b4b-ac1d-c35047fcc5d4","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6231,"visible":true,"origin":"","legend":"","description":"","filename":"9c45658f055e469f958f3df732df9098.json","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/45badaf81eaf4d2e24d08f2a.json"},{"id":92718626,"identity":"70b1e716-901d-4596-8956-951e8e81b1d2","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":120656,"visible":true,"origin":"","legend":"","description":"","filename":"9c45658f055e469f958f3df732df90981enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/50c18c65c8dd085040995983.xml"},{"id":92719810,"identity":"f22aacd2-59be-4d05-8b95-bfb445c6c0e8","added_by":"auto","created_at":"2025-10-03 13:22:20","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21691,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/a7fe8d473c2d59b2379dcd7e.png"},{"id":92719406,"identity":"a515f4ea-5c76-455d-a732-7d0c96f3cb8d","added_by":"auto","created_at":"2025-10-03 13:14:20","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":13772,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage10.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/60859373c28805759a069834.jpeg"},{"id":92718640,"identity":"c040f0c4-e37c-4735-9cf0-41a902ef0af5","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":197078,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/7b9b9fbdd256397b64af9af7.png"},{"id":92718646,"identity":"e1acc8db-ffeb-46d5-bda0-c772082232a1","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":239775,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/da2d1906526113ccfe7a233f.png"},{"id":92719408,"identity":"380bfc61-513c-44b7-9f42-6183193a8abd","added_by":"auto","created_at":"2025-10-03 13:14:20","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":160648,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/adf0724230cadd3d75f5a7b6.png"},{"id":92720594,"identity":"6404a5e0-c8ec-4334-b7fa-6c41fff13ec6","added_by":"auto","created_at":"2025-10-03 13:30:20","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":161077,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/1d4d7f9d1967cdb92de00d55.png"},{"id":92718635,"identity":"dc0ad65a-1ad0-49ee-bd33-2670a38e3340","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"jpeg","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10346,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/9b30a143a379d99e8844356f.jpeg"},{"id":92719413,"identity":"50963738-1b53-4f9c-98c2-e8e012e9b582","added_by":"auto","created_at":"2025-10-03 13:14:20","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":80041,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/f393f1361e2390b96800c4a3.png"},{"id":92719419,"identity":"3257180a-395d-4d3f-9db4-dbe5b0f6c88f","added_by":"auto","created_at":"2025-10-03 13:14:21","extension":"jpeg","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":167297,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/61a9f2a149d6d51050401a6c.jpeg"},{"id":92718658,"identity":"5dd8de39-1c1f-438b-8d88-87ca5813528f","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":212874,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/27b681b0a6058744257eb6ed.png"},{"id":92718669,"identity":"2114eb88-dac4-4a29-9cd0-9cbeaf0e62fc","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":217842,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/75632a108806e0a6d4bbb098.png"},{"id":92718651,"identity":"4df28e85-25d3-44b4-a3e8-348d2d54d713","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":224679,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/21b8e2abcde2f8b3bfe1fc61.png"},{"id":92719816,"identity":"21d8ef83-ca31-49b2-856c-fd1704349219","added_by":"auto","created_at":"2025-10-03 13:22:21","extension":"jpeg","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":29090,"visible":true,"origin":"","legend":"","description":"","filename":"groupimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/25dc5fdacb993d6a74bdbb09.jpeg"},{"id":92719817,"identity":"73134532-d549-4ab1-bbe0-b5ea9f653720","added_by":"auto","created_at":"2025-10-03 13:22:21","extension":"jpeg","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":22418,"visible":true,"origin":"","legend":"","description":"","filename":"groupimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/d5a2bed196c08f6a432eb0d8.jpeg"},{"id":92719418,"identity":"2334aaf1-9be1-4331-b0bf-645032cd6ffb","added_by":"auto","created_at":"2025-10-03 13:14:21","extension":"jpeg","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":44315,"visible":true,"origin":"","legend":"","description":"","filename":"groupimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/6c56a4bea403dd84ce807c42.jpeg"},{"id":92718644,"identity":"ca669b11-0228-4621-b4bc-bf00e74ebaf1","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"jpeg","order_by":18,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":35570,"visible":true,"origin":"","legend":"","description":"","filename":"groupimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/38f2341805360bc972100d16.jpeg"},{"id":92718645,"identity":"55b051e3-1867-41a3-acdc-084807bd9c9e","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":12237,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/6df84c2fcbf22ed42be3e307.png"},{"id":92719416,"identity":"bd3085aa-f623-4c8e-9a27-3d1f433fe7c0","added_by":"auto","created_at":"2025-10-03 13:14:21","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":3327,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/0237f5f6551fc66c6e172363.png"},{"id":92718655,"identity":"82f00de7-e1fe-4f7f-a01c-8f1daaf31d9a","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":40303,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage11.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/f397335007e1a491c2855b7c.png"},{"id":92719423,"identity":"a63a211d-cde8-4347-a65b-ff82c2b7ecfc","added_by":"auto","created_at":"2025-10-03 13:14:21","extension":"png","order_by":22,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":45955,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage12.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/1bd970ba01ee0dceb7f9ad36.png"},{"id":92719424,"identity":"1a648197-0698-4909-8ead-774e95814b80","added_by":"auto","created_at":"2025-10-03 13:14:21","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21133,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/e5b4e02dddfdbc8dd859d4e9.png"},{"id":92718654,"identity":"046c98ac-dd91-42dd-9f26-76ad09b0d043","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":24,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21556,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/88bbd31bd270715e10b54384.png"},{"id":92718648,"identity":"f53d1176-cb4f-47ee-9dd3-22624a50f71b","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2865,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/9bc2b97d5eab7fc78b30b24c.png"},{"id":92718661,"identity":"1406c9b4-719d-486e-a1fa-f60a84033206","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":26,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":26178,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/66016f66cc09d90533c69ebc.png"},{"id":92718663,"identity":"82b88e76-03a4-4f67-a2da-143bbf5b7a22","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":36589,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/78c02584bb81d304f3287493.png"},{"id":92719421,"identity":"8513d015-3dda-49fb-a5fb-ca09e8ff67c1","added_by":"auto","created_at":"2025-10-03 13:14:21","extension":"png","order_by":28,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":40680,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/db0e1d9eae991005d11c9c11.png"},{"id":92718647,"identity":"cc1340c0-9d17-4d8c-bed0-55f33c265f69","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":29,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":41216,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/d5da04b0fb99b0eb6ffdaefb.png"},{"id":92718665,"identity":"a9ec8403-3552-45bc-8bfa-a7d83a974e0a","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":30,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":40114,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/c8a89623a7b5eb96bbdfdc87.png"},{"id":92718672,"identity":"99de0a5b-1654-457a-badb-bfe164c74d20","added_by":"auto","created_at":"2025-10-03 13:06:22","extension":"png","order_by":31,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":8437,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinegroupimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/84dc548f3003177859756484.png"},{"id":92719819,"identity":"4610d39f-f09a-4b16-b703-6b90afab380c","added_by":"auto","created_at":"2025-10-03 13:22:21","extension":"png","order_by":32,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6969,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinegroupimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/d1d76da7023007c10c072aa7.png"},{"id":92719427,"identity":"7a8fd8a9-0e90-452b-ba45-fbbf8a47f286","added_by":"auto","created_at":"2025-10-03 13:14:22","extension":"png","order_by":33,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":15164,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinegroupimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/c00ff7c5c75353aec9c0208b.png"},{"id":92718668,"identity":"1bc10ab5-8d2d-40d9-a0db-a6737447d8ba","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":34,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":7346,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinegroupimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/0b13af8cd9d33a1478532f41.png"},{"id":92718656,"identity":"addff20b-3ac6-4e96-9551-e71e19df9200","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"xml","order_by":35,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":119570,"visible":true,"origin":"","legend":"","description":"","filename":"9c45658f055e469f958f3df732df90981structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/b9bcb258f0491173bd1a9511.xml"},{"id":92718670,"identity":"1391aa85-88ce-45fb-a013-661ebd054247","added_by":"auto","created_at":"2025-10-03 13:06:22","extension":"html","order_by":36,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":131607,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/60f5327fe0d5c98ec08e9fd6.html"},{"id":92718621,"identity":"53833b53-0f17-4f83-8567-1dc111c4b3c2","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":49523,"visible":true,"origin":"","legend":"\u003cp\u003eDecision tree\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/fcdc07f7213192e658415bdf.png"},{"id":92718622,"identity":"bc8a4cad-c973-44e8-8dbf-28b6069b7f6c","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":85685,"visible":true,"origin":"","legend":"\u003cp\u003eRandom Forest\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/794f31b1174ad034d2ab5c37.png"},{"id":92719405,"identity":"8a1d7d54-84da-4a4d-a167-4a6ccacfbde8","added_by":"auto","created_at":"2025-10-03 13:14:20","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":107581,"visible":true,"origin":"","legend":"\u003cp\u003eMLP Neural Network\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/6ed9b82d2d4deb4e8070ee98.png"},{"id":92719404,"identity":"d55452d6-dbb6-4b36-b1c6-184497f6f571","added_by":"auto","created_at":"2025-10-03 13:14:20","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":76334,"visible":true,"origin":"","legend":"\u003cp\u003eGradient boosting algorithm [39]\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/a1d0425b0af4a13fa034d0ef.png"},{"id":92718630,"identity":"81691a85-d83e-4f8b-93d9-e04c32e8bb76","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":72389,"visible":true,"origin":"","legend":"\u003cp\u003eA typical autoencoder\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/cb59b64ae08a8ae4b7559ee3.png"},{"id":92719811,"identity":"9ab1649b-ab5f-44d3-9151-f425f92890a6","added_by":"auto","created_at":"2025-10-03 13:22:20","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":97616,"visible":true,"origin":"","legend":"\u003cp\u003eThe idea of PCA algorithm\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/8f1c7e0a1db62d19d3501e7b.png"},{"id":92719410,"identity":"c97feb5f-f5cf-44a7-bbc1-aa9f8d179d96","added_by":"auto","created_at":"2025-10-03 13:14:20","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":11143,"visible":true,"origin":"","legend":"\u003cp\u003eEnsemble Model\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/04de1ef8cd11f8c50c072028.png"},{"id":92719407,"identity":"3d452bde-4ace-44f7-bbeb-4c0c185fbcab","added_by":"auto","created_at":"2025-10-03 13:14:20","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":24225,"visible":true,"origin":"","legend":"\u003cp\u003eSteps of the proposed model.\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/fcdcff7cb027d39e63efbe30.png"},{"id":92718627,"identity":"6d5c8360-ffe4-43ea-b389-04f2bae72aa1","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":11692,"visible":true,"origin":"","legend":"\u003cp\u003eModel performance\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/92b4d117fd8addcf79f07852.png"},{"id":92719813,"identity":"90dbacee-5ef8-41b5-8a9f-03970db5222f","added_by":"auto","created_at":"2025-10-03 13:22:21","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":153232,"visible":true,"origin":"","legend":"\u003cp\u003eTotal lift values for the algorithms used\u003c/p\u003e","description":"","filename":"10.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/e21ee912c6b2919815fe084f.png"},{"id":92718637,"identity":"9f7e45bc-5410-4cd6-9e47-3c1e47c0c88a","added_by":"auto","created_at":"2025-10-03 13:06:20","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":154972,"visible":true,"origin":"","legend":"\u003cp\u003eThe algorithms' lift value\u003c/p\u003e","description":"","filename":"11.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/f96d8a5e8697c12398aeb874.png"},{"id":92718643,"identity":"856855aa-c25b-41b9-8d45-ee807273227f","added_by":"auto","created_at":"2025-10-03 13:06:21","extension":"png","order_by":12,"title":"Figure 12","display":"","copyAsset":false,"role":"figure","size":161337,"visible":true,"origin":"","legend":"\u003cp\u003eValue of sensitivity for the used algorithms\u003c/p\u003e","description":"","filename":"12.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/879caf8e78bcab90d5a9880f.png"},{"id":92719815,"identity":"177d3b6a-8158-4991-a380-6d0315f58631","added_by":"auto","created_at":"2025-10-03 13:22:21","extension":"png","order_by":13,"title":"Figure 13","display":"","copyAsset":false,"role":"figure","size":151403,"visible":true,"origin":"","legend":"\u003cp\u003eValue of accuracy for the algorithms used\u003c/p\u003e","description":"","filename":"13.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/6297e121115d093be46085c7.png"},{"id":92719818,"identity":"53f751a0-6e4c-4fd2-b890-d4fa377520ab","added_by":"auto","created_at":"2025-10-03 13:22:21","extension":"png","order_by":14,"title":"Figure 14","display":"","copyAsset":false,"role":"figure","size":168182,"visible":true,"origin":"","legend":"\u003cp\u003eThe F1-score of the algorithms in use\u003c/p\u003e","description":"","filename":"14.png","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/06412a44ce13bd3b545e9360.png"},{"id":102312439,"identity":"3d829a91-3891-4397-b23d-e19a3a3b5c1d","added_by":"auto","created_at":"2026-02-10 12:02:03","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1959184,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7358678/v1/ca247132-1155-4ed9-a9dd-6006322b5fc5.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Optimizing Banking Sector Loan Fraud Detection through Machine Learning Methods","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eIn recent times, instances of fraudulent activities have increased in both corporate entities and the banking sector. The proliferation of information technology has played a significant role in this evolution, creating disruptions across various industries. Corporations and organizations resort to financial crime and fraud as part of their routine operations. The landscape of fraudulent activities is expanding continuously, leading to escalating costs and heightened expectations from clients. The consequences of fraud include financial losses, increased expenses related to examinations and court cases, erosion of customer confidence, and damage to brand repute. It has become a formidable challenge in the corporate world.\u003c/p\u003e\u003cp\u003eBefore the advent of information technology, manual methods were the only means to identify fraudsters. However, these methods were limited in their effectiveness, being both simplistic and time-consuming [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Contemporary society witnesses various types of financial operations fraud, such as fraudulent practices in bank loan administration and credit processes [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. To address this issue promptly, sophisticated modern technology is indispensable. While existing bank credit management and fraud detection procedures aim to minimize false alarms, they often fall short in achieving the required precision. Factors like undefined fraud parameters, insufficient data, and the presence of fraudulent duplicates negatively impact prediction accuracy [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Fraud, as distinct from a mere mistake, involves individuals engaging in illegal activities for personal gain at the expense of legitimate entities or individuals. The consequences of financial sabotage extend to the broader economy, encompassing activities such as money laundering, tax evasion, counterfeit fraud, credit card fraud, fraud in cooperative societies, and bank credit fraud [\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. These fraudulent activities adversely affect the country's economy, and though various measures have been attempted, their success has been mixed [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThe current increase in loan applications is remarkable because many people use loans for various reasons [\u003cspan additionalcitationids=\"CR13\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Research shows that non-payers impair others' access to bank and other loans. These actions prevent borrowers from getting loans [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Inefficient bank debt management causes financial crises when risky and unsuitable lending persists, causing losses for banks and lending organizations [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Since credit risk is a major issue in banking, loan defaulters and their dangers have garnered attention [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eAlarmingly, the rate of loan defaults causes financial losses for banks [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], resulting in the closure of several banks and subsequent job losses in the banking sector and related industries. When a moneylender fails to fulfill the terms of a loan agreement, it constitutes a breach of the law. Non-payment of agreed-upon installment repayments by defaulters is referred to as a non-performing debt, depending on the type of loan and the time elapsed since nonpayment (usually 90 or 189 days). The classification of a loan as non-performing is contingent on the provisions outlined in existing agreements and promissory notes. This study utilizes past loan data through data mining techniques to anticipate fraud in the management of bank loans and stop loan defaults, a task challenging for manual examination by credit officers. Data mining reveals hidden patterns not easily discerned through traditional or statistical approaches, which have limited accuracy potential due to the vast and diverse nature of the data. By employing data mining techniques, it becomes easier to differentiate between borrowers who promptly repay loans and those who do not. Assessment of a borrower's purchasing habits and reliability aids in predicting defaults and determining creditworthiness.\u003c/p\u003e\u003cp\u003eBanks can utilize data mining methods to profile and assess the risk of loan fraud across their branches, leveraging critical loan details encompassing lender and borrower names, loan location, and modifications to the loan. This data serves as the initial checkpoint to identify irregularities in the loan application process. Banks are obligated to follow precise instructions, including verifying compliance with credit requirements and confirming genuine ownership of the payback account by the credit owner.\u003c/p\u003e\u003cp\u003eAccurate prediction of factors influencing repayment is vital. A comprehensive validation of personal information, including income, employment, identity, and credit situation, is crucial in determining whether to post this information on the borrowing homepage. Predicting future repayment behavior through borrower verification is imperative, as lending institutions assign scores to borrowers based on an assessment procedure.\u003c/p\u003e\u003cp\u003eThis study emphasizes the significance of credit status in predicting defaults, focusing on borrower certifications such as credit score, job, income, and housing. The literature review in the second section explores loan fraud and data mining-based prediction models. Section 3 identifies different data mining algorithms used to create the predicted model, while Section \u003cspan refid=\"Sec3\" class=\"InternalRef\"\u003e4\u003c/span\u003e discusses the research findings and the employed dataset. The conclusion in Section \u003cspan refid=\"Sec4\" class=\"InternalRef\"\u003e5\u003c/span\u003e summarizes the study's contributions and provides insights into addressing loan fraud in the banking sector.\u003c/p\u003e\u003cp\u003eIn summary, this work addresses the detection of loan fraud, employing various data mining methods to create a model that aids the banking sector in understanding the underlying causes of loan fraud. The proposed model serves as a warning to financial decision-makers, helping prevent loan fraud. The article presents novel findings, contributing to the existing literature in this field.\u003c/p\u003e"},{"header":"2. Related work","content":"\u003cp\u003eWe examine fraud detection sequential models and data mining methods in this part. We thoroughly review several credit applications and transaction records. The biggest challenge is binary classification, where credit transactions are legal or fraudulent.\u003c/p\u003e\n\u003cp\u003eIn the realm of predictive analytics, data mining plays a pivotal role. Defined as the information extraction process from large the information extraction process from large datasets, data mining is an integral part of knowledge discovery in databases (KDD). Although closely associated with KDD, data mining functions independently, aiming to identify patterns in various data mining tasks. Data mining techniques use datasets as the foundation for creating models, with algorithms learning from these datasets to anticipate specific input outcomes [\u003cspan class=\"CitationRef\"\u003e20\u003c/span\u003e]. This knowledge acquisition process does not impact the storage of data on workstations but does influence their operations to facilitate future improvements [\u003cspan class=\"CitationRef\"\u003e21\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eA methodology in [\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e] analyses bank loan applicants\u0026apos; default risk with high accuracy and precision. [\u003cspan class=\"CitationRef\"\u003e23\u003c/span\u003e] uses decision trees to predict key credibility traits to identify reputable bank loan applicants. [\u003cspan class=\"CitationRef\"\u003e24\u003c/span\u003e] uses banking sector data to create a loan state prediction model using classification algorithms like J48, Bayes Net, and Naive Bayes, with J48 being the most accurate. [\u003cspan class=\"CitationRef\"\u003e25\u003c/span\u003e] uses an upgraded risk prediction clustering multi-dimensional algorithm and the Association Rule to detect unsuitable main and secondary loan applicants. In [\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e], a decision tree model classifies and a genetic technique selects features, confirmed by Weka.\u003c/p\u003e\n\u003cp\u003e[\u003cspan class=\"CitationRef\"\u003e27\u003c/span\u003e] introduces two data mining models for credit scoring, aiding Jordanian banks in credit decisions, demonstrating the regression model\u0026apos;s superiority in accuracy over the radial function model. [\u003cspan class=\"CitationRef\"\u003e28\u003c/span\u003e] develops multilayer-based credit scoring models, outperforming logistic regression methods, with the neural network model proving the most effective. [\u003cspan class=\"CitationRef\"\u003e29\u003c/span\u003e] compares credit scoring models using support-vector machines, revealing broad-definition models outperforming restricted-definition models. [\u003cspan class=\"CitationRef\"\u003e30\u003c/span\u003e] conducts financial data analysis using methods like Bayes classification, decision trees, random forests, boosting, and more, incorporating multilayer perceptions, logistic regression, Neural networks and support vector machines in the model. The analysis reveals exceptional performance.\u003c/p\u003e\n\u003cp\u003e[\u003cspan class=\"CitationRef\"\u003e31\u003c/span\u003e] presents a discrete survival model for the Italian banking system to investigate failure risk and provide experimental validation. [\u003cspan class=\"CitationRef\"\u003e32\u003c/span\u003e] uses client information from the business sector of a retail bank, as well as rural and urban client information from the business sector of a retail bank, as well as rural and urban Bangladeshi records to predict retail banking business locations. The study uses Weka for decision tree data mining. Another study uses decision trees (DT), artificial neural networks (ANN), and feature selection approaches including the genetic algorithm (GA) and principal component analysis to create a credit rating model for Sudanese banks (PCA). ANN outperforms DT in most circumstances and GA outperforms PCA in feature selection using German and Sudanese credit datasets. ANN surpasses DT, GA-DT, and PCA-DT on the German dataset with 80.67 percent accuracy.\u003c/p\u003e\n\u003cp\u003eThe authors of [\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e] proposed an innovative approach for classifying credit risk in the banking sector, advocating the use of data mining. The data utilized in this model originated from the banking sector, specifically aiming to forecast loan conditions. The projected models were created using three distinct algorithms: J48, BayesNet, and Naive Bayesian. Implementation and testing were carried out using the Weka program. The study\u0026apos;s outcomes revealed that, in terms of accuracy, J48 performed the best. This investigation delved into the accuracy of predictions from five credit risk classifiers under various types of interference, exploring how classifier ensembles could enhance precision.\u003c/p\u003e\n\u003cp\u003eHowever, A bigger difficulty is dealing with client behavior shifts and fraudsters\u0026apos; ability to adapt and generate new patterns. Models designed for fraud identification can assist in such cases by detecting anomalies through unsupervised learning approaches. Carcillo et al. introduced a hybrid strategy in 2019 [\u003cspan class=\"CitationRef\"\u003e34\u003c/span\u003e] that combines supervised and unsupervised methods to enhance fraud identification accuracy. The approach involves examining and evaluating unsupervised anomaly ratings generated from an actual labeled credit card fraud identification dataset at different levels of granularity. Experimental results support the effectiveness of this combination, which also improves identification precision.\u003c/p\u003e\n\u003cp\u003eEconomic fraud has become a significant concern impacting the financial system, particularly in internet transactions. Data mining is one technique employed to identify credit card fraud. Identifying credit card theft is challenging due to changing features of fraudulent and legitimate behavior over time, coupled with heavily skewed datasets. In 2020, Bagga et al. [\u003cspan class=\"CitationRef\"\u003e35\u003c/span\u003e] established a framework to analyze the effectiveness of various techniques on credit card fraud data. Techniques included ensemble learning, multilayer perceptrons, random forests, Naive Bayes, KNN, AdaBoost, and logistic regression. The efficacy of fraud detection is influenced by the factors and techniques employed.\u003c/p\u003e\n\u003cp\u003eThe study aimed to assist the banking sector in leveraging loan data for credit analysis, providing valuable information to the credit decision-making process. The research supports the banking industry by reducing the time and financial resources spent on loan reviews. Additionally, it minimizes the vulnerability experienced by loan authorities by furnishing them with insights gleaned from previous loan data through data mining methods.\u003c/p\u003e"},{"header":"3. Methodology","content":"\u003cp\u003e\u003cspan\u003e\u003c/span\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eA. Data sets\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003c/p\u003e\n\u003cp\u003eThe dataset comes from Kaggle, a platform for machine learning and data scientists. Data from credit loan transactions is included. Five basic data mining approaches are likely used to compare accuracy. The collection contains 100,000 transaction-related customer records. All dataset attributes are precise, integer-based, and multivariate. Principal component analysis and autoencoder input continuous (numerical) variables. Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e lists 10 input parameters that the model is trained and tested using. A hybrid oversampling and under sampling approach creates two distribution groups to handle dataset imbalances. The experimental credit loan fraud detection setup uses SAS Enterprise Miner.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eDescriptions of features for detecting loan fraud\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFeature\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eDescription\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFinancial Report\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCreditworthiness\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSalary Per Year\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eannual salary of the debtor\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eTerm\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWords used to describe different types of loans, such as short-term and long-term\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLoan Amount at Present Time\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eThe present loan\u0026apos;s amount\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePurpose\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eThe Goals of Obtaining a Loan: Consolidating Debt, Purchasing a Vehicle, Funding Higher Education, Paying for Expenses Like Weddings and Vacations, Starting a Small Business, Making a Large Purchase, and More\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMaximum Open Credit\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMaximum allowable credit\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eExtensive Credit Record\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDuration of the credit report\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eThe Total Amount Due on Credit Currently\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eThe sum total of all of your outstanding credit card balances\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBuying a House\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eThere are a few different ways to own a home: through a mortgage, outright purchase, or rental.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eThere Are Several Credit Issues\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSeveral credit problems\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u003cstrong\u003eB. Related Methods\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eData mining is the process of discovering relationships, trends, and outliers within large databases in order to forecast outcomes. Various approaches can be employed to identify instances of loan fraud and reduce risks associated with this information. We have applied and analyzed five data mining algorithms.\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003e\n \u003cp\u003eWhen it comes to data mining categorization problems, decision trees stand out as a popular method. Using supervised learning, this technique instructs the model to classify objects according to the types of data given to it. Picking splits strategically is crucial to a tree\u0026apos;s accuracy. The goal of using approaches like Gini to split nodes into sub-nodes in decision trees is to increase homogeneity. A node\u0026apos;s purity grows in relation to the target variable [\u003cspan class=\"CitationRef\"\u003e25\u003c/span\u003e]. Picking the split that produces the most similar sub nodes while taking into account all available features is the goal (see Fig. \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eAn ensemble classifier, the random forest (RF) combines several decision trees. The goal of utilizing multiple trees is to train them effectively, with each tree adding structure to the model (see Fig. \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e). After the trees are built, the results of them are combined. Classification and regression problems are both amenable to random forests [\u003cspan class=\"CitationRef\"\u003e36\u003c/span\u003e].\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eAs a paradigm for feedforward artificial neural networks, Multilayer Perceptron Classifiers (MLPs) guide input data toward a variety of useful outputs. Input, output, and hidden layers make up the system\u0026apos;s three levels. A signal is received by the input layer so that it can be processed [\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e]. You can see that the backpropagation Algorithm for MLP Training. Figure \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e shows an example of a neural network with an MLP architecture. To categorize intricate datasets, the hidden layer is essential.\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eAn illustration of a model with probabilities is called a Bayesian network. illustration of a probabilistic model that applies the Bayes theorem to analyze the conditional dependency structure of a collection of random variables:\u003c/p\u003e\n \u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cimg src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAATYAAABMCAYAAADncpu6AAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAAA1uSURBVHhe7d3BTxtHGwfgn797mizkVPVQeXxM5EpZUlSZSkQKdmgPkVBlo1yQqIJ2c6ragOqCoioNqVdpkZCCIeLApfKSXjiABUYKUnGrAo6E1UMPZa0ccrTZpPkD9juUWeHxem1jG2PzPtIeMu+yzLrs29kZz4zHsiwLhBDSQf4nFhBCSLujxEYI6TiU2AghHYcSGyGk41BiI4R0HEpshJCOQ4mNENJxKLG1MV3X4fF4So6uri5EIhFks1nxR8pKp9MIhULQdR3ZbBZdXV0l1/V4PPD5fFBVFYVCoejnNU2Dx+MpKhOpqgqPx+NYL1VVoaoqcrmcGHLUyHvP5XL273fiVu+enh5omlbyeZAWs0hbSyQSFgBre3u7qEySJEuSJMswjKLznSiKYsmybO3v79tlhmFYAKxYLFZUpiiKBcAKBoN2uWVZViwWsyr9OUmSZAGwotGoGLKso3ozxqxEIiGGHDXi3vn5a2trYsjmVu98Pm/FYjGLMVb0+ZHWcv9LJGfe9vZ2ycNtWZYVjUYtANb8/HxRuYg/lPl8XgyVJDaOMVaSxColtkQiYYXDYYsxZsmyLIZt/H6qSRL13ns1v6vaert9juT00atoh7p48SIA4O3bt2LIlsvlMDExgdnZWXR3d4vhsnw+n1hU0crKCm7fvo1wOIxMJlP2lTMQCEBRFAwNDYmhqlVz7wAwMjKCaDQKv98vhmzV1nt8fBwAMDU1JYZIC1Bi61D8ob5y5YoYsj158gSMMQwODoqhsgqFAnZ3dxEMBsVQWblcDsvLy7h58yZu3boFANjc3BRPs42NjcEwDCSTSTFUlWruXdd1GIaB0dFRMWSrtd73799HPB6n/rYzgBJbhykUClhYWMD09DSCwaBr0tJ1HQMDA2JxWdlsFvfu3QNjDLFYTAyXtbm5CUVR0N3djUAgAEmSsLW1JZ5m8/v9kCQJq6urYshVLfe+srICxhi8Xq8YstVa797eXqBC8iOnRHw3Je2F9xMdP2RZtmKxmGt/z/7+ftk+NE68LgCLMVbSp2VV6GNjjBV1zofDYQuAa/2CwaBrn5ZVx71bRwMC4gCI6CT1RplBBnK6qMXWIba3t3E0GIS9vT2Mj4+79pu9e/cOAPDJJ5+IoSKxWMy+rmEYuHv3Lvr6+qDruniqo2w2i8PDw6LW0+3btwEAOzs7x84slclkxCJHtd47AJimiRs3bojFtnrq/fLlS7GInDJKbKRqXq8X4+PjkGUZw8PDVfUlPX/+HKZpFn3XbHh4GABqftU8Te1ab/IfSmykZrw19Pfff4uhEvF4HGtra3aLih+Msapbfa3QrvUm/6HEdk5duHABAPDHH3+IoYp4S+39998XQ0WSySRM03TsxA+HwzBN0/Hb/Jwsy2JRw0iShBcvXojFQAPqfe3aNbGInDJKbG3u9evXwAkSFB95fPXqlRgCjr7qIMrlctA0DZlMBoqiuI4oAsCDBw/AGBOLgWPfNXv8+LEYAgDs7u6ip6dHLC5y0nsHgIGBARwcHIjFQB315snu6tWrYoicNnE0gbQPPqWIH9VOReIURbEYY2Kxtb+/b08jEg8+6igSR0X57AQcjaQeJ9ZbUZSiOB+xdZvmJF6j1nvnPy9Ou6qn3vPz8xYqjJqS00GJ7Rzj80FrTQpOxMRWj3IJt9EYYyXJqR6Nvh45OXoVPce8Xi+i0SgmJyerGuE8Del0GvF4HEtLS2Ko4WZnZxGPx137y6qlaRoODw/x8OFDMURagBLbOffo0SMMDAwgFAo15AGvRzKZxMjICBKJBAKBgBhuuMHBQSQSCfT39594+haOktqzZ8+wtbVV8ftz5HRQYiOYm5vDzMwMJiYm6nrA66GqKlZXV5FKpRCJRMRw00QiEWQyGayuruK7774TwxXxAY6dnR3XyfTkdHms/6aBEEJIx6AWGyGk41BiI4R0HEpshJCOQ4mNENJxKLERQjoOJTZCSMehxEYI6TiU2AghHYcSGyGk47Q0sem6XrT0Mj+6uroQiURqmruYTqcRCoXKrm7q8/nQ1dUlFts0TYPH4ykqU1W1pG78CIVCRdOPVFWFqqqO65gR0s4KhQI0TXNcHy+Xy0FVVcfpaD09PdA0rTULLIjLfZw2vsbV8Z2PEomEJUmSJUlSyXpZThRFsWRZLrujN1/fy22Nr3LL7iiKUlK+trZmr1cm1psx1pBlgAg5C/jafOIafPl83opGo/ZzJcb5ObFYzGKMlX02m6WlLTYA+OCDD8QiRCIRKIoC0zQr7tGoaRpSqRTW19fLTkJeWFhANBoFAPz+++9i2NWHH34oFmFwcBBzc3OAsHprJBLB0tIShoeHa2ptEnIWFQoF9Pf34/Hjx/ZO99ydO3dw69Yt5PP5ovLjuru7MT4+jrt372JoaOhUW24tT2zl8CWY+a7eTnK5HCYmJjA7O+u6XIyu6xgdHYUsy1heXhbDJ+KUkAEgEAhAURQMDQ2JIULaytTUFBhjGBsbE0NYX19HIBBwfe44nhSnpqbEUNOc2cTGE9qVK1fEkO3JkydgjDluusHpum7v+P3FF1/AMIyGtKZ4S81pX86xsTEYhtGyJYAIqVcul0M8HsfXX38thk7k/v37iMfjp9ZqO3OJrVAoYGFhAdPT0wgGgxWT1sDAgFhcZGVlxf6PEwwGAQB//vmncFb1CoUCdF3Hjz/+iFgs5rggIt8ohfafJO2KdwFdv35dDJ1Ib28vcOy6zXZmEltfXx88Hg8uX76MxcVFxGIx/PLLL+Jptmw2C9M0HfvAuFwuh+XlZdy8eRM4lnAWFxfFUyvio6GXL1/G8PAwFEXB6OioeJrt+vXr2NvbE4tt6XS6ZKTV7Uin0+IlCGmara0t4Gj5+Ebg/d9//fWXGGqKM5PYtre37U1p9/b2MD4+7vr+/u7dO6DMqyC3ubmJcDhcdB2+YmqtTeLjm+aura0hlUohFAq5XieTyYhFtkAgULIZr9vh1DIkpFnevHljv+E00suXL8Wipjgzia0ZFhcXsby8XNTyicfjQJ1N4sHBQXz//ffIZDL4+eefxTAhpMU6NrFls1lkMhnk8/milo9hGMBR31s93nvvPaCO/wPRqyghzdO2ie3ChQuAyy7gz58/RzAYLHmd9Xq9kGUZqVSqqLxW//77L1ChD0KWZbHIRq+i5Cy7dOkSNjY2xOK6Xbt2TSxqipYnttevXwMuCaocPhDw6tUrMWQPVZdLOt3d3TBNEwsLC2KohNP1k8kkHjx4AEmSHL/jAwC7u7uOU1AIaQf9/f3A0bPkhn916tdffxVDRfh5V69eFUPNIU5FOE18OhU/ap2K5LRjOJ8Cwq8p7szNp0iJv9NpSpV4Lj8kSbLC4XDZaSJ8Cle56VuNItaLjrN3tCvDMCy4PJPRaNSSZbnoXiVJsoLBoONzMT8/bwGw8vm8GGqK9v3kq/jwa+GU2E7KKeES0m4a+XfMGCtpZDRTy19F6+H1ehGNRjE5Oen6tYvTlE6nEY/HsbS0JIYIaSvffPMNDg8Pq+qycaNpGg4PD/Hw4UMx1DRtndgA4NGjRxgYGEAoFGrIVKl6JJNJjIyMIJFIUGc/aXterxdbW1v49ttvoWmaGK6Kpml49uwZtra2SgbymqntExsAzM3NYWZmBhMTEy2bn6mqKlZXV5FKpRCJRMQwIW3J7/fjn3/+AQCEQiEx7IoPnu3s7JRdeadZPNZ/ndCEENIxOqLFRgghx1FiI4R0HEpshJCOQ4mNENJxKLERQjoOJTZCSMehxEZOzOfzlSyv5PF47P0ka6HrOnw+H3K5XNn9Zp2uTfu5EkfiHCtCasEYs4LBoP1vwzCscDhsAbCi0WjRuU7y+bwly7IVDoeLJkg77Te7v79vT7w+vo8l7edKRNRiI3Xx+XxF//Z6vXj69CkA2KsVu7lz5w4YY9B1vWjKjdP2hn6/394+8cWLF3Y57edKRJTYSMPxBGWaphgqous6NjY27ERYjXJr7NF+ruQ4Smyk4fhKK24rCAPA5OQkFEWpaXI0XyL9xo0bYoj2cyU2SmykobLZrD1ZemZmRgzbstksDMPAp59+KobKSiaT+OqrrxAOhx23PqT9XAlHiY3UbWNjwx617O/vB2MM29vbrks38U2rnfrSjuP7zXo8Hnz22Wfo7u7G9PR02VZepf1cyflAiY3ULRgM2pvOHB4eQtd116QGAG/fvgWO+sbcHN9vdn9/H5cuXYIsy66DBG77uZLzgRIbaRt+vx9Pnz6FaZr48ssvxTAhNkpspK3wV1BqlRE3lNhIS1y8eBE4NspZLT7iyhgTQ7ZKo7Gk81FiI3U5ODjA7u5uzZvp9Pb2Asf2lRU5lWezWdy7dw8A8MMPP4hhgPZzJUcosZET8/l8MAwDpmni448/FsOu/H4/GGP47bffxBB0Xcfw8DAgjIp+9NFHePPmDdbW1hz3lchmszBNE59//rkYIucM7XlAWoYnMMMwys4oqIWqqkilUjg4OBBD5JyhxEZaqqenx54rWo90Oo2+vr6K358j5wO9ipKWWl9fh2EYiEQiNffTcbSfKxFRi42cCbqu46effsLy8nJNr6WqqgJHu5bX8nOks/0f54C8j2EzTT4AAAAASUVORK5CYII=\"\u003e\u003c/p\u003e\n\u003cp\u003eWhen we think about occurrences A and B, we have P(A|B) for A provided that B is true, P(B|A) for B given that A is true, and P(A) and P(B) for B and A, respectively, that stand as independent probabilities. Two separate events, A and B, have occurred. [\u003cspan class=\"CitationRef\"\u003e38\u003c/span\u003e] Training Bayesian classifiers is faster than other methods, although learning them could take more time.\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003e\n \u003cp\u003eSimilar to previous boosting methods, Gradient boosting, sometimes referred to as a statistical prediction model, makes differential loss functions amenable to optimization and expansion. Gradient boosting is a common component of algorithms used for regression and classification.\u003c/p\u003e\n \u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eA regression approach similar to boosting is gradient boosting [\u003cspan class=\"CitationRef\"\u003e39\u003c/span\u003e]. In gradient boosting, applied to a given training dataset D\u0026thinsp;=\u0026thinsp;xi, yi, N, the aim is to construct an approximation of the function F0(x). This approximation minimizes the expected value of a specific loss function, linking instances x to their corresponding output values y, denoted as L(y, F(x)). Gradient boosting generates an additive approximation of F0, presenting a weighted sum of functions, as illustrated in Fig. \u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e\n\u003cp\u003eA comprehensive examination of when to utilize PCA and autoencoder, two widely employed dimensionality reduction methods in machine learning research, was essential. Both autoencoder and principal component analysis are effective for linear and non-linear surfaces, marking a significant distinction between these two methods.\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003e\n \u003cp\u003eAn autoencoder is an unsupervised deep learning method. The number of inputs and outputs in this straightforward feed-forward network is exactly equal. The lower-dimensional code, obtained from the higher-dimensional code, can faithfully reproduce the input. It is often referred to as a latent space representation, serving as a condensed knowledge representation. Given its primary use in input reconstruction, the training process must aim for maximum precision [\u003cspan class=\"CitationRef\"\u003e40\u003c/span\u003e]. The architecture includes both encoders and decoders, as illustrated in Fig. \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e, showcasing a typical autoencoder.\u003c/p\u003e\n \u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe encoder and decoder exhibit similarities, essentially being counterparts of each other. Comprising convolutional blocks and coordinated components, the encoder contributes to providing input to the autoencoder. The encoder features a node count identical to the dataset\u0026apos;s number of features. The output from the encoder is directed to the compressed layer, commonly referred to as the latent space.\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003e\n \u003cp\u003eOne way to reduce dimensionality is by using principal component analysis (PCA), which involves decreasing the range of lower-order dimensions to a reasonably compact set. Reducing the number of features or dimensions helps with overfitting mitigation and makes it easier to see the link between variables. In order to keep information intact while reducing dimensions, it is crucial to avoid losing any details in the pursuit of minimizing features. This is why principal component analysis (PCA) was developed; it is very good at reducing the number of dimensions while keeping data intact [\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e]. The principle-concept analysis (PCA) algorithm is shown in Fig. \u003cspan class=\"InternalRef\"\u003e6\u003c/span\u003e.\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003eModel for an Ensemble. To improve overall performance, ensemble strategies combine different models. The model with the most votes at the end of each test case is used to calculate the final output prediction in majority voting ensemble models. In order to create a new model, the ensemble takes a function of the posterior probability or projected values of different models and applies them to the problem at hand. This is how the majority voting algorithm works:(see fig 7):\u003c/li\u003e\n\u003c/ul\u003e"},{"header":"4. Results and the suggested model","content":"\u003cp\u003eA number of data mining approaches have been utilized, as shown in Fig. \u003cspan class=\"InternalRef\"\u003e8\u003c/span\u003e of the proposed model. These techniques include ensemble models, decision trees, gradient boosting, autoencoder, Bayesian networks, and random forests. Furthermore, the best data mining algorithms have been identified using a variety of performance metrics, including F1 score, ROC separation, accuracy, cumulative lift, and lift. A number of steps, including preprocessing, modification, modelling, and evaluation, are involved in tackling the problem of malicious entity identification.\u003c/p\u003e\n\u003cp\u003eData cleaning was performed in the preprocessing stage to ensure it met the quality standards. Afterwards, in the modification step, we used a variety of strategies to decrease the initial feature space\u0026apos;s dimensionality. After that, we built the models and evaluated their quality; this allowed us to choose the best model and establish some baseline characteristics for fraud detection. We used the k-nearest neighbors method to complete any gaps in the data, the support vector machine method to get rid of anomalies, and statistical analysis to get rid of outliers.\u003c/p\u003e\n\u003cp\u003eApproaches such as decision trees, artificial neural networks, gradient boosting, random forests, Bayesian networks, and ensemble methods combining decision trees and neural networks were used to solve classification problems. To put the suggested categorization approach into action, SAS Enterprise Miner was utilized.\u003c/p\u003e\n\u003cp\u003eTo create new features, SAS Enterprise Miner provides two methods via its \u0026quot;Feature Extraction\u0026quot; node: principal component analysis (PCA) and autoencoder. To get the best results, both procedures were used, and then we compared the two.\u003c/p\u003e\n\u003cp\u003eFollowing model training on the training data, hyperparameters were adjusted using the validation set. Table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e displays a performance metric for several approaches.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eSeveral algorithms\u0026apos; performance metrics\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModel Name\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFalse Positive Rate\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eF1 Score\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eArea Under ROC\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNeural Network\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8202\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.52\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGradient Boosting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8606\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.032198\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.504996\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.89707\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGradient Boosting and PCA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAutoencoder and\u003c/p\u003e\n \u003cp\u003eGradient Boosting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eForest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8270\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.005393\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.10426\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.827042\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDecision Tree\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8240\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.013544\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.130786\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.752008\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eEnsemble\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.822\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.759661\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBayesian Network\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.8033\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.119992\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.44835\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.759582\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003ePerformance measures are depicted visually in Fig. \u003cspan class=\"InternalRef\"\u003e9\u003c/span\u003e. Our binary classification research incorporates a comprehensive presentation of accuracy, false positive rate, F1 scores, and the area under the ROC achieved for each classification model.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eA. Cumulative lift\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo calculate the cumulative lift, we sort all partitions from most likely to least likely based on the target event\u0026apos;s anticipated probability, where for the target name, Credit Status is the anticipated probability of the event \u0026quot;Fraud.\u0026quot;(P Credit Status fraud). The total number of events is calculated across all quantiles after splitting the data divided the data into 20 quantiles (demideciles), each of which represents 5% of the entire data.\u003c/p\u003e\n\u003cp\u003eCumulative lift values for several algorithms in the train and validation partitions are shown in Fig. \u003cspan class=\"InternalRef\"\u003e10\u003c/span\u003e. The ratio of the number of events that would occur randomly or consistently to the number of events that occur at the present quantile is the cumulative lift for that quantile. Another name for it is the ratio of the total response rate to the initial response rate. The first two quantiles, which stand for the highest 10% of data and the randomly distributed 10%, make up the cumulative lift at depth 10. It is far more probable to see an event in quantiles than to randomly select observations, according to cumulative lift estimates.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eB. Lift measure\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo generate the lift measure, each division is sorted in decreasing order based on the expected likelihood of the target event. P credit status fraud \u0026quot;Yes\u0026quot; indicates the likelihood of the expected fraud incident occurring. With each quantile representing 5% of the total data, we calculated the total occurrences in each of the 20 quantiles, also known as deciles.\u003c/p\u003e\n\u003cp\u003eLift is the percentage of responses that deviate from the baseline response percentage when comparing a quantile to the percentage of events that would happen equally or at random. Lift measurements illustrate the distinction between predicting 5% of occurrences in each of the 20 quantiles and selecting observations at random. This stands in contrast to the probability of seeing an event in every quantile. As can be seen in Fig. \u003cspan class=\"InternalRef\"\u003e11\u003c/span\u003e, different algorithms have different lift measure values in the train and validation partitions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eC. Sensitivity assessment\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eUsing a variety of cut-off values, the ROC curve graphically displays the relationship between sensitivity and specificity; this relationship is obtained from the confusion matrix. At the 1-specificity value, where the validated partition shows the greatest significant difference between 1-specificity and sensitivity, the Kolmogorov-Smirnov (KS) reference line is drawn to help choose the appropriate cut-off for data assessment. The sensitivity measures for all the different methods in the training and validation sets are displayed in Fig. \u003cspan class=\"InternalRef\"\u003e12\u003c/span\u003e.\u003c/p\u003e\n\u003cp\u003eFigure \u003cspan class=\"InternalRef\"\u003e12\u003c/span\u003e illustrates that sensitivity measurements for different algorithms have different values in the train and validation partitions. Contrasting the Kolmogorov-Smirnov in statistical terms, the empirical distribution function of the sample is contrasted with the actual distribution functions of the two models or the cumulative distribution function of the reference distribution. The fit gap is considered significant when the KS value is less than 0.05.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eD. Accuracy\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe accuracy metric, which is assessed across multiple threshold levels, represents the percentage of observations that are accurately classified as events or nonevents. Cut-off values fall between 0 and 1 and are measured in steps of 0.05. All cut-off values take into consideration the predicted goal category. The likelihood that the desired credit status event \u0026quot;yes\u0026quot; will occur in this instance is the cut-off value, and it is written as P credit status fraud. P credit status fraud is classified as having happened in the anticipated category if it is larger than or equal to the cut-off figure; if not, it is considered to have been insignificant. Correct sorting is defined as true positives for the predicted classification and true negatives for the original categories, respectively. If the perceived sorting differs from the real classification, the observation is sorted wrongly. To find the precision, use the following formula. The accuracy measurement findings for the training and validation partitions of different methods are variable, as Fig. \u003cspan class=\"InternalRef\"\u003e13\u003c/span\u003e illustrates.\u003c/p\u003e\n\u003cp\u003e\u003cimg src=\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAfAAAABYCAYAAAAKlKXRAAAAAXNSR0IArs4c6QAAAARnQU1BAACxjwv8YQUAAAAJcEhZcwAADsMAAA7DAcdvqGQAABV/SURBVHhe7d3hixvH3Qfw7z7vW3d170oJ4VZ9EVq4Eq/skp5f9CBecc2LllxZlZZSsCFeYQwN7p0ruS9KU+e0aQmU+k4qZzCFVnuxoab4VN8ZbKj0orG3RnpR+iK3wgSTV1IUJ3/APC+iGXZXupPufE5vdd8PLIlmZndnZ+X97c7OnDQhhAARERElyv/FE4iIiOjwYwAnIiJKIAZwIiKiBGIAJyIiSiAGcCIiogRiACciIkogBnAiIqIEYgAnIiJKIAZwIiKiBGIAJyIiSiAGcCIiogRiACciIkogBnAiIqIEYgAnIiJKIAZwomegaRqy2Ww8eSjXdaFpGhqNRjyLJlij0YCmaXBdN55F9EwYwImIiBJIE0KIeCLRKK7r4t69e7hz504868g7rG2TzWYxNzeHxcXFeNaOGo0GTp06BV4mRmNb0ReNT+C0Lzdu3IgnUd8ktc0//vGPeBLtgG1FXzQGcNqTVquFTCYD3/exubkJTdPUe+BWq4VUKqXe98my8nMulxt4Z+x5ntpG/N1wq9VS62iahlwuh263Gykjhfddq9XgeR7S6TQ0TUM6nUar1YqvAtd1VZlUKoV8Pj+w/XAZTdOQyWRQq9UAQKXL49mtbeLHGa5vKpWC53lAvz1kejqdHlqPdDqtyj9P2WwWV65cAfrv+uUy6jxXKhVVVoqvE7aX84xYu7daLWSz2YF2DBvVdt1uF/l8XtUvvsgylUpFHaemaSgWi2ob47YVAOTzeZWfyWTUNsLblnXca9vQESOI9sGyLGFZVjxZBEEgAAjbtoXjOKJerwvTNEWpVBJCCGGa5sB6pVJJABD1el2lBUEgdF0XjuOITqcjgiAQhmEMrBu2sbEhAAjTNEW5XBaivx3DMISu66LT6aiyjuMIXdfVPuv1utB1XZimqcrE61Wv1wUAdSxCiKF12qltqtXqyO2J/n7D9S0UCpG6xus1LsuyBvY1itxX3KjzXCgUBtYbdrz7Oc+i3+6GYYhCoRBZT9f1SLlx2s6yLGEYhgiCQAghRLlcHqhnuI6dTkc4jiMAiI2NDVVmVFuFt2dZ1tCy4XL7bRs6Oga/QURj2ClIif5FaKe8YesNu6jKADusXLPZjKRLMkAUCoVIugyc4aA+rJy8cFerVSFCF9l44I9fiOPHMyxNhOoXPk7DMCI3DUIIYdt25CIerruEfvDci4MM4GLEeR623rAAvp/zLEJBNyy+3rhtF6+T2OFGM2zYsQw7ZileVn4nwzcAzWYzcuO237aho4Nd6PRczM3NxZP2xPM8nDhxIpL2yiuvAAD+9a9/RdLjjh07Fvn8ta99DQDw9OlTAMDdu3cBAN/5znci5b7xjW8AAD788EMAwIULFwAAJ0+ehOu6aLfbWFlZ2dMgsFFs24bv+2i32wCAdruNra0tnDlzBgjVVdZNsiwLW1tbkbSweDewpmnY3NzE0tLSQHq8S3sv/pfnOfyKIeyzzz4DnqHtvgivvvoqAOD27dsqrVKp4NKlS5iamgKesW3oaGAAp0Op1+tF3iNrmoZTp04BoUC8X3L9L3/5y/GsiPn5eTSbTZimiaWlJRiG8UzBbph4oL558yZyuZy6iMu6njp1aiAY93q90Jai+r1rkcWyLJRKpYH0g7wh2asv4jyPajvLsrC8vKzGSXieB9/38bOf/UyVke/As9ks0um0quN+TU1NwXEc9a672+3C8zwsLCyoMs+zbWgyMIDToeU4zkCwOciA8+mnn8aTBszMzMDzPARBANu2sbS0NDAI6llMT0/DNE2sra0BAJaXl3Hx4sV4MQRBMNAOYkKmKz3v8zyq7VZWVpBKpfCtb30Lmqbh8uXLKJVKyOVyQD+4njx5Eq1WC3/5y1+wvb2Ner0e2sP+vPbaa+j1eqjVaurGbXp6OlLmebcNJRsDOO3bgwcP4klj2dzcjHwe9jRh2/aBdXM+efIEAPDNb34T6D9xIdZ9Oaycpmmqa3t6ehpXr14FQl3su9lL25w9exa+78N1XZw+fTpyEZd13cv2nodho/jHEZ5ZMOyG6SDPc9y4bZfP5/HWW2+p4Li9vR0JkNeuXUMQBPjNb36jekZ2M25bzc/PQ9d13L59G++88w7eeOONSP7zbBuaDAzgtC/Hjx9XTw/dblc9rciA9/jx49ganzt+/DjQ76YEgFqthn//+99A7AL/05/+FEEQRKZ2NRoNZLNZtY+d/OlPf1KBo9FoIJ/PwzRNzM/PA/2nasuysLq6qqaEyXKWZaly6F9E5f6vXbsGhN5DAsD29ja2t7fVZ+zSNvIGQf5Xkt2my8vLOH/+fCRvZmYGpmkin8+rY+p2u3BdNzKN6XmRNzPvvfce0A927XZ75HmW6/31r38F+kFN3jCFb9j2e56Htbvcrvwejdt28RvKODmm4ubNm0D/Oy6PK3ws+2kr2Y2eTqcxMzMTydtv29AREh/VRjSOTqejRmlblqVGxeq6LgAIAMJxnPhqkfUMwxDlclmN6NV1PTK6dmNjQ5imqbZn2/auo2/ldkzTFIZhqG3KaThhciqQrK+u62pKklQoFCL7l/WVbNuO1E0a1jbNZlOVxZBRxLZt7zjqOV5XwzAGRk2PYz+j0EVoSphhGGrU9KjzLELrhdtWriNH+ot9nOdh7S5Hdcv9yfXHaTs5JSy+2LYdGREuty2/T7Jc+Dsxqq3iI+Ll92KnKYF7bRs6WvinVGliyD9lWSqV+I6QxtJqtfDd734X9+/fjzwBy/RcLoeVlZXIOkSHBbvQiejIevvtt3HixImB7uuZmRkYhoGPP/44kk50mDCA08SQ7z6HDYojGubll1/G5uYmKpWKes8s35P7vj8wJoHoMGEXOk0E2eUp5/dWq1U1eIxoN67r4saNG/B9X6XZto3z589jdnY2UpboMGEAJyIiSiB2oRMRESUQAzgREVECMYATERElEAM4ERFRAjGAExERJRADOBERUQIxgBMRESUQAzgREVECTVQAb7fb0DSNf4GLiChB5E+qyp9JbbfbKBaLyGQy0DRNXdfDv7VeqVSQy+Uivzl/1EzUX2KrVCo4d+4cAKDT6WBqaipehIiIDolut4tsNgvDMHD16lV1zdY0DYZh4Pr165idnVW/NKjrOj744ANVrtFo4Oc//zlOnz6N3/72t7GtT76JegJ/5513UCgUAADvv/9+PJuIiA6RH//4xzAMA57nDTxw/eEPf1B/i352dhaO46DX6+G///2vKjM7O4s7d+5gdXUVlUoltPbRMDEBXHatvPnmmwCA27dvx0oQEdFh4XkeNjc3cfXq1XgWhBCYn5+PpL344ouRz9LU1BRWVlZw7tw51QV/VExMAK9UKrBtG1NTU7AsC57nxYsA/UCfy+XUe5VUKoVisajyu90u8vk8UqmUKpPNZgFApbmuq8rn83lVbqe0YrGo1pU8z4vUI5vNqp8zlHaqa6VSiaTJm5fwfjkOgIgOs8uXL8NxnIEn7508fvwYAPDVr341noVcLgdd13Ht2rV41mQTE0LXdREEgRBCiHK5LACIZrMZKRMEgdB1XTiOIzqdjuh0OsK2bWFZlipjmqYwTVNtq1QqCdlMnU5HABClUkmVF0IIy7JUGclxHAFAOI4jNjY2VJ2EEMK27cg+ZF6hUFDrj6qr3H69XlfriH5dDMMQnU4nkk5EdFg0m00BQFSr1XjWUJ1OR10Pd2LbtjAMI5480SYigFerVWGapvosvxzhgCj6QU/X9UhaWLVaHRr4w8YN4DLwxwPsTgBEbiRG1VUeY7lcjqQbhrHrPwoAYy/x4yQiOgjyoWXc66N86NntwURec3crM2kmogv91q1bOHv2rPo8MzMDwzCwtbUVKed5Hk6cOBFJC7t16xbQX/+gyEEYezWqrjMzMzBNE2traypNTqfYrfu8f9M21rK4uBhfnYjomT19+hQY8/roui62trawtrY2Vnd7eJDbpEt8AO92u1hfX8e5c+fU+19N0xAEAXzfjwxq6PV6mJubi6wf9sknn8CyrHjycyHfgct5jnGj6goAZ8+ehe/76h34H//4R/ziF7+IFyMiSiTP87C8vIz79+8f6IPVpEh8AL958yZM0xx4eqxWqwCAu3fvRso/evQo8jnuwYMH8aQDl8vl8Pvf/x5XrlzBw4cP8XnP9qBRdV1YWAAAvPfee2i329ja2lJpOwnf5IxawoP1iIi+SK1WC/l8HisrKwzeO0h8AF9bW8MPf/jDeDJeffVVAMD9+/dVmmVZ8H0/VCpqbm4OvV5v5FSEe/fuRT7HR4/vptFoYH19Hb/+9a8xPT0dz1ZG1RX96ROO42B1dRW/+93vcOnSpZFdTPEbnd0WdqET0fNw7NgxIPTaL67b7eL111+H4ziRV4Ke5418sHjppZfiSRMr0QG8VqvB93288MIL8SwVyNbX11UX84ULFxAEQWTamOd56vPCwgJ0XUc+n1dBudFoRL5ApmniwYMHapvhL1M4kMspD/GbgS996UsAgD//+c9Af51isQjDMCLrj6qr9Nprr6HX62F1dRVnzpyJ5BERHUbf/va3AQBPnjyJZwGAmg4W/utq3W4X169fD5WKevToEQzDGPkQM1Hio9qSQo4Yl0t85LVhGCpP13U1srxarao8XddFqVSKjFpsNpvCNE21ruM4arqXzJfrm6Yp6vW6Gv0opzDIEZbxfUvlclnoui4ACNu2RbPZVCPZbdtW5UbVVTIMY9fpFQcp3OZcuHD53y5JttN1S04Zix+rXOIzbyRd1wdmHk26ifpb6EdRt9vF17/+dfi+v2uXPBHRYeJ5Hn70ox8hCIJnvnYd5LaShAE8gWS3/eLiIlzXxePHj7GyshIvRkR0qGUyGfW30PdLPsRcunTpyI3bSfQ78KPsxo0bKBaLWF5exsWLF+PZRESH3p07dxAEAXK53J4GA0utVgvZbBaO4xy54A0G8GQ6duwYfN/H+vo6/v73vx+pLiMimhxTU1N4+PAhvv/97yObzQ4M+t1NpVLB22+/jXffffdI/pQo2IVORESUTHwCJyIiSiAGcCIiogRiACciIkogBnAiIqIEYgAnIiJKIAZwIiKiBGIAJyIiSiAGcKLnIJPJqN9Vf1bpdBqapiGbzcazJpLrutA0bcefmiSizzGAEz0HDx8+hGma8eR9ef/99+NJREQM4DTZnvVJLpvNRn7zfS8O6neJD2o7h43rukN7FRYXFyGEwOzsbDyLiEIYwGli1Wq1eNKedLtdbG5uxpPpgNy4cSOeRER7wABOE6lYLOInP/kJAODUqVPqfbR8Gu92u8jn80ilUtA0Del0OvKk7XkeTp48CQBYWlpS68sy7XYbxWJRvZ9OpVL7/knEWq0WeWeey+XQarXixYDQry/JfcZ7B1qtFnK5nNpWKpVCLpeLlAnvL5VKoVgsqjzP8yJt5XmeOsbr16+r9gofr+d5kXbEiPZptVrIZDLwfR+bm5tqf9lsdmD/YfFjy2QykZu0Vqul6uG6Lmq1mtp/Op0eaFPXdVX+sO0RHXqCaELV63UBQNTr9Uh6p9MRpmkK0zRFEARCCCHK5bIAIAqFQqQsAFEqlSJpzWZTABDlclkIIUQQBMI0TQFAdDodVc6yLDHqn1i1Wh26LV3XVd1Evx6GYYhCoSA6nY7odDrCcRwBQFSrVSH6x6XrurBtW5WxbTtSh42Njcj+ZBuFj1HWybZtUS6XxcbGhtB1XdTr9aHlhRCiVCoJXddFp9PZU/tYlhXayufk/sPnrdlsCl3XheM4A8e/sbGhygVBIAAI0zQHjtG2bVWuVCpF9rHTcREdZrtfXYgSbKcALgNE+MIvhFDBLh44x7moxwOCGDOAG4YhTNOMpMkA6DiOSpNBKazT6QgAKgjK4w3fhMiAJhmGEQlkol9PXdfV51HBbFidbdvesbzYpX2GBfBh500G6/ANgBBC6LouDMOIpMXbTgw5RnluwttzHGfXYyA6bNiFTkfOrVu3AEB1kUsvv/wyAOCjjz6KpD8vrVYLQRDg9OnTkfSZmRmg3w0dFh/MNjU1Bcuy1Hv62dlZmKaJK1euIJfLoVarYXp6GvIXg+X+5HFKc3Nz6PV6A13Mr7zySuSzZNs2fN9X9Wu329ja2sKZM2fiRQ+M53kwTXOgDU6cOIEgCCJpAPDiiy/Gk9Dr9dT/X7hwAeh/B1zXRbvdxsrKChYXF0NrEB1uDOB05HzyySfAkIC4F/IdbzabRSqVwtLSUrzISJ999hkA4NixY/GsfXv48CFKpRJ838f3vvc9ZDIZFZjl/sLv9DVNU3WX+aPIQH337l0AwM2bN5HL5SLteRDtE9br9Z7pfMXNz8+j2WzCNE0sLS3BMIyB8QREhx0DOB1Z3W43njSWRqMB0zTxwgsv4M6dO/j4449RKpXixcb29OnTeNKe6Loe+by4uIjt7W1Uq1UEQYDXX389kl+tVtF/fRZZxp22NT09DdM0sba2BgBYXl7GxYsXVf5Bt4+03/O1k5mZGXiehyAIYNs2lpaW9j0Qkeh/gQGcJt5//vOfyOcf/OAHQOgJUnr8+DEA4KWXXoqkP3r0KPL5rbfegmEYeOONNyLpezU7Owtd17G+vh5Jl13Tx48fj6THA5ic5ia74F3XRT6fV/m5XA6O46guZrm/f/7zn6rMfp09exa+78N1XZw+fRrT09Mqby/t8+DBg3jSUPFue2l7e3tffzBH0zS1renpaVy9ehUA8OGHH8ZKEh1eDOA0sWQg/tvf/oZutwvP8+B5HhYWFqDrOi5fvqy6lyuVClZXV1EqlSJdtaZpYmtrC+12G61WC8ViEV/5ylfg+75at9Fo4N69ewCATz/9VK0rA2488IZdunQJQRCoqVztdhu2bUPXdbz55puRsr7vo1KpAP1t/upXvwIA/PKXv1RlVldX1VNku93G+vo6LMtS+Y7jYHV1VW0H/ffL4T+o8uTJk8h/h1lYWAD6T9/nz5+P5I3bPsePH0ev10OtVkO321XT3YbtX+4jn8+j2+2qaYBBEODdd99V5WRQljdj0rBzYdu2+nzt2jVgl/f+RIdSfFQb0SSpVqtC13Wh63pkhHGz2VSjztGfoiWnHYU1m01hGIYa2dzpdEQQBGoUs2EYolQqqZHTAESz2VSjpmWZ3ZRKJbUP9Kc7hUfCi/5o62q1GqmzZVmRkdr1el3Yti10Xd91W+H9hadmidAIeLnIKWrD2LY9dBT5OO0j+qPoZTnLskSz2RzYvywr+lPg5HQ09Eflx2cYhI9dns9h56JQKES2tdP5JzrMNCGHqBIREVFisAudiIgogRjAiYiIEogBnIiIKIEYwImIiBKIAZyIiCiBGMCJiIgSiAGciIgogRjAiYiIEuj/AZ9ozRQkGprxAAAAAElFTkSuQmCC\"\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eE. F1 score\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA classification metric called the F1 score is evaluated at different cut-off points and is derived from the confusion matrix. It takes precision and recall into account (or sensitivity). There is a 0.05-unit increment from 0 to 1 for the cut-off values. By classifying the prediction target, we can ensure that all cut-off values take P credit status into consideration. In this case, \u0026quot;yes\u0026quot; indicates that the estimated likelihood of the target fraud event is greater than or equal to the cut-off number. When the value is higher than or equal to the cut-off value, predictions of occurrences in P credit status fraud are positive; otherwise, they are negative. The varying F1 score values for the training and validation phases of the various techniques are displayed in Fig. \u003cspan class=\"InternalRef\"\u003e14\u003c/span\u003e.\u003c/p\u003e"},{"header":"5. Models Discussion","content":"\u003cp\u003eTables\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Tab8\" class=\"InternalRef\"\u003e8\u003c/span\u003e show the many the optimal model for production deployment is selected based on statistical criteria. Root average squared error, average squared error, and misclassification rate, Gini coefficient, multi-class log loss, and Kolmogorov-Smirnov are some examples of these measures (KS).\u003c/p\u003e\u003cp\u003e\u003cb\u003eA. Gini coefficient\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe Gini coefficient is a metric utilized to assess the level of discrimination within a population.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eGini coefficient\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eName of Model\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTrain\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eValidate\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNeural Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eForest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.69341\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.65489\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEnsemble\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.52656\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.51932\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBayesian Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.52201\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.51917\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.83017\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.794035\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAutoencoder and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePCA and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.51265\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.5041\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe Gini coefficient is an equality measure ranging from 0 to 1, where 0 signifies perfect equality and 1 represents perfect discrimination [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. A better model outcome is indicated by the neural network's smaller Gini in the two-partition dataset.\u003c/p\u003e\u003cp\u003e\u003cb\u003eB. Misclassification rate\u003c/b\u003e\u003c/p\u003e\u003cp\u003eA performance metric called the misclassification rate shows the proportion of inaccurate forecasts without making a distinction between forecasts that are positively and negatively erroneous.[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e].\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eMisclassification rate\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eName of Model\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTrain\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eValidate\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNeural Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.47041\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.47040\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eForest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.22146\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.22757\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEnsemble\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.23166\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.23303\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBayesian Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.46404\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.47103\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.16720\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.18270\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAutoencoder and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.00273\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.00287\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePCA and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.00264\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.00290\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.23036\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.23147\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe gradient boosting and autoencoder models in the validated dataset partition exhibit a lower rate of misclassification when compared to other models.\u003c/p\u003e\u003cp\u003e\u003cb\u003eC. Average Square Error (ASE)\u003c/b\u003e\u003c/p\u003e\u003cp\u003eA model is considered perfect when its Average Square Error (ASE) value is zero; the lower the number, the better the model's performance [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e].\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAverage Square Error\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eName of Model\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTrain\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eValidate\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNeural Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.11569\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.11569\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eForest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.03885\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.04218\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEnsemble\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.01399\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.01405\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBayesian Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.00213\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.00222\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.05503\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.05535\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAutoencoder and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.04985\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.05193\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePCA and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.07020\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.07035\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.10308\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.10435\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eRoot average squared error (RASE) assigns additional weight to large errors.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eRoot average squared error\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eName of Model\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTrain\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eValidate\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNeural Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.34014\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.34013\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eForest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.19711\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.20538\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEnsemble\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.11827\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.11855\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBayesian Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.04611\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.04707\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.23458\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.23526\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAutoencoder and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.22326\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.22788\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePCA and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.26495\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.26524\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.32107\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.32303\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe validated dataset partition's PCA and Gradient Boosting model attains the optimal values for both ASE and RASE.\u003c/p\u003e\u003cp\u003e\u003cb\u003eD. Multi-Class Log Loss\u003c/b\u003e\u003c/p\u003e\u003cp\u003eMulti-class log loss is the sum of the log loss values for each class prediction and has a well-known multi-class extension.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 7\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eMulti-class log loss\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eName of Model\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTrain\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eValidate\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNeural Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1.37291\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1.37281\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eForest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.38279\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.41653\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEnsemble\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.30525\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.30596\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBayesian Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.09472\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.09574\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.53605\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.54760\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAutoencoder and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.48040\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.50637\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePCA and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.83034\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.83315\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.93514\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.95596\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eMulti-Class Log Loss (MCLL) values below 0.1 indicate a better fit. All metrics indicate that the PCA and gradient-boosting model is optimal.\u003c/p\u003e\u003cp\u003e\u003cb\u003eE. Kolmogorov-Smirnov (KS)\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe Kolmogorov-Smirnov test is employed to determine whether a sample is normally distributed or not.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab8\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 8\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eKS (Youden)\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eName of Model\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTrain\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eValidate\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNeural Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.0000\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.0000\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.6784\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.6496\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAutoencoder and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1.0000\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1.0000\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePCA and Gradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1.0000\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1.0000\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.4877\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.4821\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eForest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.5498\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.5196\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEnsemble\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.4877\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.4818\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBayesian Network\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e0.4728\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.4719\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe Kolmogorov-Smirnov statistic measures the greatest difference in cumulative distributions, where 1 signifies the best match for events and 0 indicates the worst for non-events. In the two-partition dataset, the autoencoder and gradient boosting, as well as PCA and gradient boosting, indicate the outcome of a superior model.\u003c/p\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eThe lending industry's viability has been severely threatened by the recent uptick in borrower payback defaults, which has led to a dramatic rise in bank loan fraud. According to new studies, current risk assessment techniques could miss important signals and risk indicators that have an impact on repayment. Bank loan fraud has increased dramatically with the introduction of new technologies, costing financial institutions a tonne of money every year. Installing a fraud detection system that keeps tabs on all transactions in real time is vital for preventing and discouraging theft.\u003c/p\u003e\u003cp\u003eIn order to protect their reputations, financial institutions should invest in fraud management technology. This will not only minimise risks but also lower fraud expenses. To get to the bottom of these major loan fraud problems, an ensemble model was used, which combines different data mining techniques. The best data mining techniques were selected based on performance metrics such as accuracy, lift, cumulative lift, and F1 score. To further assist in selecting the best model for production situations, models that fit statistical measures, including average square error, misclassification rate, Gini coefficient, root average squared error, multi-class log loss, and Kolmogorov-Smirnov (KS), were suggested.\u003c/p\u003e\u003cp\u003eThe results showed that a smaller value more faithfully reflected the flawlessness of the model. Nevertheless, it was pointed out that at this time, there isn't a single model that can be considered best for every company situation. Still, the chosen model worked well for most purposes and produced the expected results.\u003c/p\u003e\u003cp\u003eFinally, the best models for identifying loan fraud were the autoencoder and the gradient-boosting classifier, which achieved the greatest accuracy. Based on these results, autoencoders and gradient boosting could greatly improve banking sector and fraud detection strategies used today for classification.\u003c/p\u003e\u003cp\u003e\u003cb\u003eFund\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe authors extend their appreciation to the Deanship of Research and Graduate Studies at King Khalid University for funding this work through the Small Research Project under grant number \u003cem\u003eRGP1/38/46\u003c/em\u003e.\u003c/p\u003e\u003cp\u003e\u003cb\u003eEthics\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe authors declare that the work presented in this manuscript is original and has been conducted in accordance with ethical standards. No human participants, animals, or identifiable data were involved in this study. All data used are publicly available on the Kaggle website, and no ethical approval was required.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eDT\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eDecision Trees\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eANN\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eArtificial Neural Networks\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eGA\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eGenetic Algorithm\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003ePCA\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003ePrincipal Component Analysis\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eRF\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eKS\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eKolmogorov-Smirnov\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eASE\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eAverage Square Error\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eRASE\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eRoot Average Squared Error\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv class=\"DefinitionListEntry\"\u003e\u003cdiv class=\"Term\"\u003eMCLL\u003c/div\u003e\u003cdiv class=\"Description\"\u003e\u003cp\u003eMulti-Class Log Loss\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eThe authors have contributed equally in the preparation of the manuscript, including: analysis, scientific formalization methodology, numerical experiments, result analysis, writing, and reviewing.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe data is publicly available on the Kaggle website\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eJurgovsky, Johannes, \u0026ldquo;Sequence classification for credit-card fraud detection,\u0026rdquo; Expert Systems with Applications, 2018, pp.234-245.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eAishwarya, S.Dhivya, K. Devika Rani, \u0026ldquo;Online payment fraud prevention using cryptographic algorithm TDES,\u0026rdquo; Int J Comput Sci Mob Comput, Vol.4, No.4, 2015, pp.317-323. \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eAgwu, M. Edwin, \u0026ldquo;Reputational risk impact of internal frauds on bank customers in Nigeria,\u0026rdquo;.International Journal of Development and Management Review, 9.1, 2014, pp.175-192. \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eAgaba; CALEB, Tamwesigire; ETON, Marus, \u0026ldquo;Credit Risk Management Practices and Loan Performance of Commercial Banks in Uganda,\u0026rdquo; 2022.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eRashid, Md Abdur, \u0026ldquo;An overview of corporate fraud and its prevention approach,\u0026rdquo; Australasian Accounting Business \u0026amp; Finance Journal,Vol.16, No.1, 2022, pp.101-118.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eThomas, Manju Susan; Mathew, Juby, \u0026ldquo;Supervised Machine Learning Model for Automating Continuous Internal Audit Workflow,\u0026rdquo; In: 2022 6th International Conference on Trends in Electronics and Informatics (ICOEI), IEEE, 2022, pp.1200-1206. \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eFahd Sabry Esmail, Fahad Kamal Alsheref, Amal Elsayed Aboutabl, \u0026ldquo;Review of Loan Fraud Detection Process in the Banking Sector Using Data Mining Techniques,\u0026rdquo; International journal of electrical and computer engineering systems, Vol.14, No.2, 2023, pp.229-239. \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eMOROKE, Ntebogang Dinah; MAKATJANE, \u0026ldquo;KatlehoPredictive Modelling for Financial Fraud Detection Using Data Analytics: A Gradient-Boosting Decision Tree,\u0026rdquo; Applications of Machine Learning and Deep Learning for Privacy and Cybersecurity. IGI Global, 2022, pp.25-45.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eTANG, Jiali; KARIM, Khondkar E, \u0026ldquo;Financial fraud detection and big data analytics\u0026ndash;implications on auditors\u0026rsquo; use of fraud brainstorming session,\u0026rdquo; Managerial Auditing Journal, Vol.34, No.3, 2019, pp.324-337.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eMehmet, N. A. R. , \u0026ldquo;The effects of behavioral economics on tax amnesty. International Journal of Economics and Financial Issues,\u0026rdquo;Vol.5, No.2, 2015, pp.580-589.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eRandhawa, Kuldeep, \u0026ldquo;Credit card fraud detection using AdaBoost and majority voting,\u0026rdquo; IEEE access 6, 2018, pp.14277-14284.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eOreski, S., \u0026amp; Oreski, G.,\u0026ldquo;Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert systems with applications,\u0026rdquo; Vol.41, No.4, 2014, pp.2052-2064.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eYannelis, Constantine; Tracey, Greg., \u0026ldquo;Student loans and borrower outcomes ,\u0026rdquo; Annual Review of Financial Economics, 14, 2022, pp. 167-186. \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eMetawa, N., Hassan, M. K., \u0026amp; Elhoseny, M.,\u0026ldquo;Genetic algorithm based model for optimizing bank lending decisions,\u0026rdquo; Expert Systems with Applications, 80, 2017, pp.75-82.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eNascimento, Paulo Augusto Meyer Mattos,\u0026ldquo;Modelling income contingent loans for higher education student financing in Brazil,\u0026rdquo; 2019.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eChapman, B., \u0026amp; Sinning, M. Student loan reforms for German higher education: financing tuition fees,\u0026rdquo; Education Economics, Vol.22, No.6, 2014, pp. 569588.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eInfant Cyril, G. L., \u0026amp; Ananth, J. P., \u0026ldquo;Deep learning based loan eligibility prediction with Social Border Collie Optimization,\u0026rdquo;\u003cem\u003e\u0026nbsp;\u003c/em\u003eKybernetes, 2022.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eAkko\u0026ccedil;, S.,\u0026ldquo;An empirical comparison of conventional techniques, neural networks and the three-stage hybrid Adaptive Neuro-Fuzzy Inference System\u0026nbsp; (ANFIS) model for credit scoring analysis: The case of Turkish credit card data,\u0026rdquo; European Journal of Operational Research,Vol. 222, No.1, 2012, pp.168178.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eByanjankar, A., Heikkil\u0026auml;, M., \u0026amp; Mezei, J., \u0026ldquo;Predicting credit risk in peerto-peer lending: A neural network approach,\u0026rdquo; In 2015 IEEE Symposium Series on Computational Intelligence, December 2015, pp. 719-725.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eImran, Imran, \u0026ldquo;Using machine learning algorithms for housing price prediction: the case of Islamabad housing data,\u0026rdquo; Soft Computing and Machine Intelligence, Vol.1, No.1, 2021, pp.11-23.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eWitten, H.; and Frank, E., \u0026ldquo;Data mining: practical machine learning tools and techniques.,\u0026rdquo; San Francisco, CA: Morgan Kaufmann, 2017\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eSudhamathy G and Jothi Venkateswaran, \u0026ldquo;Analytics Using R for Predicting Credit Defaulters,\u0026rdquo; IEEE international conference on advances in computer applications (ICACA), 2016, \u0026nbsp;pp.66-71.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eM. Sudhakar, and C.V.K. Reddy, \u0026ldquo;Two Step Credit Risk Assessment Model For Retail Bank Loan Applications Using Decision Tree Data Mining Technique,\u0026rdquo; International Journal of Advanced Research inComputer Engineering \u0026amp; Technology (IJARCET), Vol. 5, No.3, 2016, pp.705-718.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eJ.H. Aboobyda, and M.A. Tarig, \u0026ldquo;Developing Prediction Model Of Loan Risk In Banks Using Data Mining,\u0026rdquo; Machine Learning and Applications: An International Journal (MLAIJ), Vol. 3, No.1, 2016, pp.1\u0026ndash;9.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eK. Kavitha, \u0026ldquo;Clustering Loan Applicants based on Risk Percentage using K-Means Clustering Techniques,\u0026rdquo; International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 6, No.2, 2016, pp.162\u0026ndash;166.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eZ. Somayyeh, and M. Abdolkarim,\u0026ldquo;Natural Customer Ranking of Banks in Terms of Credit Risk by Using Data Mining A Case Study: Branches of Mellat Bank of Iran,\u0026rdquo; Jurnal UMP Social Sciences and Technology Management, Vol. 3, No. 2, 2015, pp. 307\u0026ndash;316.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eA.B. Hussain, and F.K.E. Shorouq, \u0026ldquo;Credit risk assessment model for Jordanian commercial banks: Neuralscoring approach,\u0026rdquo; Review of Development Finance, Elsevier, Vol. 4, 2014, pp. 20\u0026ndash;28.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eA. Blanco, R. Mejias, J. Lara, and S. Rayo, \u0026ldquo;Credit scoring models for the microfinance industry using neural networks: evidence from Peru,\u0026rdquo; Expert Systems with Applications, vol. 40, 2013, pp. 356\u0026ndash;364.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eT. Harris, \u0026ldquo;Quantitative credit risk assessment using support vector machines: Broad versus Narrow default definitions,\u0026rdquo; Expert Systems with Applications, vol. 40, 2013, pp. 4404\u0026ndash;4413.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eDileep B. Desai, Dr. R.V.Kulkarni, \u0026ldquo;A Review: Application of Data Mining Tools in CRM for Selected Banks,\u0026rdquo; International Journal of Computer Science and Information Technologies (IJCSIT), Vol. 4, No.2, 2013, pp.199-201.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eG. Francesca, \u0026ldquo;A Discrete-Time Hazard Model for Loans: Some Evidence from Italian Banking System,\u0026rdquo; American Journal of Applied Sciences,Vol.9, No.9, 2012, pp. 1337\u0026ndash;1346.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eRafiqul, I; and Ahsan H., \u0026ldquo;A data mining approach to predict prospective business sector for lending in retail banking using decision tree,\u0026rdquo; International Journal of Data Mining \u0026amp; Knowledge Management Process, Vol.5,No.2, 2015, pp.13-18.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eJafar, A. Mohammed T., \u0026ldquo;Developing prediction model of loan risk in banks using data mining. Machine Learning and Applications,\u0026rdquo; An International journal, Vol.3, No.1, 2016, pp.1-8.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eF. Carcillo, Y.-A. Le Borgne, O. Caelen, Y. Kessaci, F. Oble, and G. Bontempi, \u0026ldquo;Combining unsupervised and supervised learning in credit card fraud detection,\u0026rdquo; Information Sciences, vol.557, 2021, pp. 317\u0026ndash;331. \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eS. Bagga, A. Goyal, N. Gupta, and A. Goyal, \u0026ldquo;Credit card fraud detection using pipeling and ensemble learning,\u0026rdquo; Procedia Computer Science, vol.173, 2020, pp. 104-112. \u0026nbsp;\u003c/li\u003e\n \u003cli\u003eXuan, Shiyang, \u0026ldquo;Random forest for credit card fraud detection,\u0026rdquo; In: 2018 IEEE 15th international conference on networking, sensing and control (ICNSC). IEEE, 2018, pp.1-6.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eJana, Dipak Kumar, \u0026ldquo;Optimization of effluents using artificial neural network and support vector regression in detergent industrial wastewater treatment,\u0026rdquo; Cleaner Chemical Engineering,Vol. 3, 2022, pp.100039.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eKm, Anil Kumar, \u0026ldquo;Detection of False Income Level Claims Using Machine Learning,\u0026rdquo; International Journal of Modern Education \u0026amp; Computer Science, Vol.14, No.1, 2022.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eYao, S., \u0026ldquo;Gradient boosted decision trees for combustion chemistry integration,\u0026rdquo; Applications in Energy and Combustion Science, Vol.11, 2022, pp.100077.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eAbhaya, Abhaya; PATRA, Bidyut Kr., \u0026ldquo;An efficient method for autoencoder based outlier detection,\u0026rdquo; Expert Systems with Applications, Vol.213, 2023, pp.118904.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eFahd Sabry Esmail, Fahad Kamal Alsheref, and Amal Elsayed Aboutabl, Enhancing loan fraud detection process in the banking sector using data mining techniques, Indonesian Journal of Electrical Engineering and Computer Science, Vol 32 No. 2, 2023.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Loan Fraud Detection, Neural Network, Deep Learning, Autoencoder, Gradient, Boosting, Data Mining Techniques","lastPublishedDoi":"10.21203/rs.3.rs-7358678/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7358678/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLoan fraud has been a permanent challenge for the financial sector. It is crucial to ensure the stability of economic and customer trust. Predicting loan fraud is extremely important to eliminate the possibility of the occurrence of crises like the subprime mortgage crisis in 2008. Moreover, the immense number of loan applicants makes it impossible for employees to perform this task manually, especially considering the different number of parameters that need to be investigated. This paper proposes automated artificial intelligence models to detect loan fraud through predictive analysis techniques, with an emphasis on the integration of neural networks and deep learning techniques. The approach combines the autoencoder-based architecture with gradient boosting to ensure the detection of fraud activities. The model was applied to an online dataset from Kaggle, which contains 100,000 credit loan transactions. The model achieves optimal accuracy in fraud detection. This emphasizes the effectiveness of combining deep learning techniques with the autoencoder-based architecture for fraud detection. Additionally, the promising results present the effectiveness of the dimensionality reduction techniques of feature space to enhance the accuracy of the proposed models.\u003c/p\u003e","manuscriptTitle":"Optimizing Banking Sector Loan Fraud Detection through Machine Learning Methods","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-10-03 13:06:15","doi":"10.21203/rs.3.rs-7358678/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"de411408-8e5c-48fa-9658-425b51667d9b","owner":[],"postedDate":"October 3rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":55661368,"name":"Physical sciences/Engineering"},{"id":55661369,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2026-02-10T11:59:35+00:00","versionOfRecord":[],"versionCreatedAt":"2025-10-03 13:06:15","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7358678","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7358678","identity":"rs-7358678","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00