Machine Learning-Based Cardiovascular Disease Prediction: Comparative Analysis of SMOTE Impact on Imbalanced Healthcare Data

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 139,755 characters · extracted from preprint-html · click to expand
Machine Learning-Based Cardiovascular Disease Prediction: Comparative Analysis of SMOTE Impact on Imbalanced Healthcare Data | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Machine Learning-Based Cardiovascular Disease Prediction: Comparative Analysis of SMOTE Impact on Imbalanced Healthcare Data Md. Saon Sikder, Engr. Md. Emad Uddin Aksir This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7428299/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Cardiovascular disease (CVD) constitutes the primary global mortality cause, affecting 18 million individuals annually. Machine learning approaches for CVD prediction face significant challenges due to inherent class imbalance in healthcare datasets, where disease-positive cases are substantially underrepresented, leading to biased model performance favoring majority classes. This comprehensive study evaluated ten machine learning algorithms including Random Forest, Support Vector Machine, XGBoost, and ensemble methods on the Behavioral Risk Factor Surveillance System (BRFSS) dataset containing 308,070 patient records. The Boruta algorithm identified optimal feature subsets, while RandomizedSearchCV performed hyperparameter optimization. Model performance was assessed both on original imbalanced data and after applying Synthetic Minority Over-sampling Technique (SMOTE) for class balancing. Original imbalanced datasets yielded high overall accuracies (~ 92%) but severely compromised minority class detection (F1-scores: 0.00-0.28). SMOTE implementation dramatically enhanced minority class performance: Stacking ensemble achieved optimal results with 94.49% accuracy and 0.94 F1-score for CVD-positive cases. Ensemble methods demonstrated superior adaptability to class balancing compared to linear algorithms, which showed substantial performance degradation. SMOTE effectively mitigates class imbalance challenges in cardiovascular disease prediction, significantly improving minority class detection capabilities while preserving overall model accuracy, establishing ensemble methods as optimal approaches for imbalanced healthcare applications. cardiovascular disease prediction machine learning SMOTE class imbalance ensemble methods healthcare analytics Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Introduction Cardiovascular disease (CVD) stands as the principal global reason for mortality according to the World Health Organization (WHO). Every year globally Cardiovascular disease (CVD) leads to around 18 million deaths, a percentage is declared about 31% [ 1 ]. Cardiovascular disease is a well-defined range of diseases that lead to the pathology of the heart and vascular system, which in most cases compromise a number of anatomical or physiological elements. The clinical manifestations are varied; some patients show distinct symptoms while others have none; making it difficult to diagnose and come up with intervention in time. Such conditions as vascular stenosis in the coronary, peripheral or systemic circulation, malformations of cardiac or vascular nature, structural defects in the functioning of the valves, disorders related to electrical conduction of the heart and allowing its individual disruptions in the form of arrhythmias fall under the category [ 2 ]. There is no perfect reason of the cardiovascular disease (CVD) because it is a multifaceted disease and many of the risk factors are interacting. Most of these risk factors have a lot of connection with the lifestyle of an individual and their behavioral facts, which can be corrected to some extent. With the early detection of these parameters, assessment could be conducted in advance to determine the risk probability of individuals to develop CVD thus the intervention to minimize such risk could be conducted early. Among predictive factors, age, general health condition, the amount of physical activity, the history of smoking and alcohol drinking, and the comorbidities, including diabetes should be considered the factors commonly identified as essential in predictive modeling of the CVD. Clinicians tend to begin with simple tests by investigating into your family history and lifestyle factors. They do so in order to know what type of risk you are exposed to regarding your heart. This procedure may be time consuming and costly. That is why now more individuals propose the application of machine learning and data analytics in order to detect heart disease at an earlier stage. Such tools can support the quality of care decision making by physicians since they will be able to identify warning signs of care well in advance [ 3 ], [ 4 ]. This would reduce the possibility of improper diagnosis and would allow physicians to intervene before it is too late to take care of the patient [ 5 ]. The prediction of cardiovascular disease (CVD) using machine learning (ML) and deep learning (DL) techniques has gained significant attention in recent years due to the rising global burden of heart-related illnesses. With advancements in data analytics and healthcare technology, researchers have focused on developing models that leverage clinical, demographic, and lifestyle data to enhance early detection and improve patient outcomes [ 6 ], [ 7 ]. The paper highlights the diversity of approaches, from traditional ML classifiers to advanced DL architectures, and their contributions to improving CVD prediction. A study explored seven ML classifiers, including Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM), using Kaggle and IEEE DataPort datasets, with RF achieving the highest accuracy of 91.67% after hyperparameter tuning with RandomizedSearchCV [ 8 ]. Another paper proposed a hybrid ensemble model combining RF, Gradient Boosting, and Gaussian Naive Bayes, reporting an accuracy of 89.5% with improved precision and recall through ensemble voting [ 9 ]. Similarly, a convolutional neural network (CNN) model with multimodal data integration achieved 94.8% accuracy, outperforming unimodal CNNs due to its faster convergence and robust feature extraction [ 10 ]. Ensemble techniques, such as bagging and stacking, were also investigated, with XGBoost achieving a standout accuracy of 92.72% for myocardial infarction prediction, highlighting the effectiveness of gradient boosting methods [ 10 ]. Several studies emphasized the role of feature selection and hyperparameter optimization. A paper applied k-modes clustering with Huang initialization alongside RF and XGBoost, achieving 90.8% accuracy on the Cleveland dataset, underscoring the importance of clustering for data preprocessing [ 11 ]. Another work used Harris Hawks Optimization (HHO) with Artificial Neural Networks (ANNs), reporting 88.9% accuracy and improved F1-scores through optimized feature selection [ 12 ]. A systematic review and meta-analysis of ML models using electronic health records (EHRs) found that SVM and RF consistently outperformed traditional risk scores like QRISK3, with accuracies ranging from 85–90% and enhanced recall through feature engineering [ 8 ]. Deep learning models, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), were compared in a recent study, with the hybrid CNN-LSTM model achieving 92% accuracy, leveraging SHAP for interpretability [ 13 ]. Some papers focused on specific risk factors or novel datasets. Another study highlighted Body Mass Index (BMI) as a critical predictor, using RF and LR to achieve 87.5% accuracy, with GridSearchCV for hyperparameter tuning [ 14 ]. Another work utilized reinforcement learning to predict heart disease, achieving 86% accuracy with a focus on dynamic feature weighting [ 15 ]. A paper on wearable IoT devices and ML reported accuracies above 90% with neural networks, emphasizing real-time monitoring capabilities [ 16 ]. The use of blockchain for secure data sharing in CVD prediction was explored, with DL models achieving 89% accuracy [ 17 ]. Other studies addressed specific CVD subtypes or innovative methods. A paper on ischemic heart disease used ML classifiers like SVM and RF, achieving 88% accuracy with feature ranking techniques [ 18 ]. A Satin Bowerbird Optimization-based DL model for e-healthcare applications reported 90.5% accuracy, focusing on scalability in real-world settings [ 17 ]. Methodology To complete this thesis, we undertook a rigorous methodology involving several key stages. We began with data management and preprocessing to ensure the quality and readiness of our dataset. This was followed by crucial steps in feature engineering and hyperparameter tuning to optimize our models, leading into the model implementation and thorough performance evaluation of our results. Data Collection and Description To commence the model evaluation a dataset was collected from “Behavioral Risk Factor Surveillance System (BRFSS)” through Kaggle website that contains the dataset published in 2021 [ 19 ]. This unique categorical dataset comprising 308,070 rows and 17 columns encompass a variety of health-related attributes and potential risk factors associated with Cardiovascular diseases about U.S. residents. The features and the description of the dataset showed in the Table 1 . Table 1 Description of the dataset features Feature Explanation Age_Category Age in complete years from 18 to 80+ General_Health Poor, Very Good, Fair, Good, Excellent Exercise Yes or No (Yes = 1, No = 0) Skin_Cancer Yes or No (Yes = 1, No = 0) Other_Cancer Yes or No (Yes = 1, No = 0) Depression Yes or No (Yes = 1, No = 0) Arthritis Yes or No (Yes = 1, No = 0) Sex Female or Male (Female = 0, Male = 1) Height_(cm) Height in Centimeter Weight_(kg) Weight in Kilo gram BMI Body Mass index floating value Smoking_History Yes or No (Yes = 1, No = 0) Alcohol_Consumption Alcohol consumption in milliliter Green_Vegetables_Consumption Green vegetables in Gram Fruit_Consumption Fruits in Gram Diabetes No = 0, No, pre-diabetes or borderline diabetes = 1, Yes = 2, Yes, but female told only during pregnancy = 3 Cardio_Disease Yes or No (Yes = 1, No = 0) Data Pre-processing To achieve optimal output from the dataset, it is necessary to do data cleaning, transform the data into an appropriate format, and structure it for model utilization; this process is referred to as data pre-processing in machine learning. Initially, the presence of null values was checked for, but none were found. However, we identified several duplicate values, which have been eliminated. Most machine learning models cannot process categorical or text-based data. Therefore, we employed the Label Encoding approach to turn categorical columns into numerical values shown in the Fig. 1 for processing by the algorithms. Data Balancing using SMOTE A significant problem in medical datasets, especially those associated with diseases classification, is the imbalanced class distribution, wherein one class (e.g., absence of disease) substantially exceeds the other class (e.g., presence of disease). This study revealed an enormous class imbalance in the target variable “Cardio_Disease,” which affected the fairness and accuracy of machine learning models, especially in detecting cases of the minority class. The primary dataset exhibited a notable class imbalance, comprising 283,883 instances of 'No' and 24,971 instances of 'Yes', signifying a bias towards the negative class. To address this, we utilized the Synthetic Minority Over-sampling Technique (SMOTE) [ 20 ]. SMOTE mitigates class imbalance by creating synthetic samples for the minority class instead of replicating existing instances. The dataset has 283,103 occurrences of 'No' and 226,482 instances of 'Yes' for 'Cardio_Disease', resulting in a balanced distribution in Fig. 2 . Synthetic samples are generated by interpolating between existing examples of the minority class, thereby enhancing variability and mitigating overfitting. The complete machine learning pipeline was conducted twice for comprehensive experimentation. Once on the original imbalanced dataset Once on the SMOTE-balanced dataset This dual-path methodology enabled an extensive comparison of performance measures both before and after to the use of SMOTE, which represents the cornerstone of our analysis. Feature Engineering Feature engineering involves many strategies designed to enhance model performance through the optimization of input data. In this study, we employed two principal feature engineering techniques: correlation analysis and wrapper-based feature selection. Correlation Matrix Analysis A Pearson correlation matrix was developed to clarify the linear correlations among the features [ 21 ], as depicted in Fig. 3 This matrix elucidates the extent of correlation between variables, with values spanning from − 1 (perfect negative correlation) to + 1 (perfect positive correlation). A number close to zero signifies minimal to no linear correlation. The analysis demonstrated a robust positive association between Weight (kg) and BMI (r = 0.86), a moderately strong correlation between Height (cm) and Weight (kg) (r ≈ 0.47), and a link between Height (cm) and Sex (r ≈ 0.70). Such elevated correlations indicate the possibility of multicollinearity, especially in linear models. While tree-based models can manage correlated features more effectively, precautions were implemented to prevent substantially redundant features from compromising model interpretability. Although some features had minimal connection with the target variable Cardio_Disease, they were not promptly eliminated. Low Pearson correlation does not inherently indicate a deficiency in predictive power, particularly in non-linear models. Subsequent feature selection was conducted with the Boruta algorithm, which evaluates feature significance concerning model efficacy. Feature Selection with Boruta Algorithm The Boruta feature selection method was used to find the most important traits for predicting cardiovascular disease. Boruta is a strong feature selection method that works with a Random Forest classifier. It works by making shadow features, which are random replicas of the original features, and then comparing their importance scores to those of the real features. Important features are those that continuously do better than their shadow equivalents. Figure 4 shows that the Boruta analysis found that Age_Category, Diabetes, Arthritis, General_Health, and Smoking_History were some of the most important predictors. Conversely, factors like Fruit_Consumption, Green_Vegetables_Consumption, and Depression demonstrated negligible significance and were hence omitted from the final model input. This process ensured that the models were trained using only statistically significant variables, reducing the dimensionality of the dataset and potentially enhancing model performance and generalization. Train-Test Split To critically evaluate model performance, the preprocessed and feature-selected dataset was divided into training and testing sets in an 80:20 ratio. A stratified sampling method was utilized to maintain the class distribution in both groups. This guarantees that the models are trained on a subset of the data and assessed on novel instances, so mitigating overfitting and enhancing generalizability. Hyperparameter Tuning Hyperparameter optimization is essential for enhancing the efficacy and generalizability of machine learning models. Within this study, we utilized RandomizedSearchCV, which is a randomized search algorithm that samples from the given hyperparameter distributions over a fixed number of iterations. Contrary to exhaustive grid search, randomized search is computationally more manageable yet still examines an extensive configuration space [ 22 ]. For each of the models, a randomized search was conducted with 5-fold cross-validation and an evaluation metric based on classification accuracy. This enabled a rigorous testing of model performance across a variety of parameter settings shown in Table 2 . The search space for each of the algorithms was drawn from existing work and best practice in clinical predictive modeling. Table 2 Parameters used for each model in the Without SMOTE and With SMOTE method Models Parameters Random Forest 'n_estimators': [400,500,600], 'min_samples_split': randint(2, 20), 'min_samples_leaf': randint(1, 20), 'max_features': ['auto', 'sqrt'], 'bootstrap': [True, False] Support Vector Machine 'C': [0.1, 1, 10, 100], 'gamma': [1, 0.1, 0.01, 0.001], 'kernel': ['rbf', 'poly', 'sigmoid'] Gaussian Naive Bayes 'priors': [None, [0.3, 0.7], [0.6, 0.4]], 'var_smoothing': [1e-9, 1e-8, 1e-7, 1e-6, 1e-5] Logistic Regression 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': [None, 'l1', 'l2', 'elasticnet'], ‘solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'], 'max_iter': [100, 200, 300], 'multi_class': ['auto', 'ovr', 'multinomial'], 'class_weight': ['balanced', None], 'warm_start': [True, False] Decision Tree 'criterion': ['gini', 'entropy'], 'splitter': ['best', 'random'], 'max_depth': [None, 10, 20, 30], 'max_features': ['auto', 'sqrt', 'log2'], 'min_samples_split': [ 2 , 5 , 6 , 10 ], 'min_samples_leaf': [ 1 , 2 , 4 ], 'class_weight': ['balanced', None] Adaboosting 'n_estimators': [50, 100, 200], 'learning_rate': [0.01, 0.1, 1.0], 'algorithm': ['SAMME', 'SAMME.R'], 'random_state' : [42] K-Nearest Neighbor 'n_neighbors': [ 3 , 5 , 7 , 9 ], 'weights': ['uniform', 'distance'], 'metric': ['euclidean', 'manhattan', 'minkowski','Hamming'] XGBoost 'n_estimators': [100, 200, 300], 'learning_rate': [0.01, 0.05, 0.1, 0.2], 'max_depth': [ 3 , 4 , 5 , 6 ], 'min_child_weight': [ 1 , 2 , 3 ], 'subsample': [0.8, 0.9, 1.0], 'colsample_bytree': [0.7, 0.8, 0.9] Gradient Boosting 'n_estimators': [100, 200, 300, 400, 500], 'learning_rate': [0.01, 0.05, 0.1, 0.2, 0.3], 'max_depth': randint(3, 10), 'min_samples_split': randint(2, 20), 'min_samples_leaf': randint(1, 20), 'subsample': [0.8, 0.9, 1.0], 'max_features': ['auto', 'sqrt', None] Also, in the Stacking model, we used Random Forest, Support Vector Machine, Adaboosting, K-Nearest Neighbor, XGBoost models as base_learners. And used Linear Regression model as final_estimator. To assess computational efficiency, the total training time for each model's hyperparameter tuning process was recorded using Time Profiling techniques. This helped to evaluate not only model accuracy but also training cost in terms of execution time, which is an essential consideration in real-world applications. Experimental Results and Discussions In order to evaluate the effectiveness of multiple machine learning models on an unbalanced dataset, both with and without the use of the Synthetic Minority Over-sampling Technique (SMOTE), this chapter describes and analyzes the experimental findings of the proposed study. This study's primary goal is to assess how SMOTE affects model performance in terms of accuracy and execution time. It also examines how feature selection and hyperparameter optimization can improve classification outcomes. This study compares various algorithms under controlled experimental conditions, offering insights into the impact of data balancing and optimization strategies on predictive performance, essential for creating robust and reliable machine learning solutions for real-world imbalanced classification issues. Figure 5 , graphically represent the proposed methods of the experiments. Environmental Setup The experiments included in this study were performed on a desktop computer featuring an AMD Ryzen 5 5600G processor (6 cores, 12 threads), 32 GB of RAM, and a 250 GB SSD for quick information retrieval. The implementation was executed using the Python programming language, with development conducted in Jupyter Notebook within the Visual Studio Code (VS Code) environment. A variety of crucial Python libraries and modules have been implemented to facilitate handling data, preprocessing, development of models, and performance measurement. The tools encompassed pandas and numpy for data manipulation, matplotlib and seaborn for visualization, and scikit-learn for machine learning tasks including model training, evaluation, and hyperparameter optimization. Furthermore, RandomizedSearchCV was used for hyperparameter optimization, and BorutaPy was used for feature selection. By speeding up certain model computations, the scikit-learn-intelex (imported via sklearnex) patch was created to increase performance. To assess training effectiveness, Python's time package was used to enable timing during each trial. Optimal Hyperparameters As mentioned in Table 2 , the initial parameters were not necessarily the best for each model. Below in Table 3 , is the list of models along with their optimal hyperparameters, identified separately for both with and without SMOTE conditions. Table 3 Optimal Hyperparameters with and without SMOTE of each Model Model Optimal Hyperparameters (Without SMOTE) Optimal Hyperparameters (With SMOTE) Random Forest 'bootstrap':True,'max_features': 'sqrt', 'min_samples_leaf': 12, 'min_samples_split':8, 'n_estimators': 500 'bootstrap':False,'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split':6,'n_estimators': 600 Support Vector Machine 'kernel': 'sigmoid', 'gamma': 0.1, 'C': 0.1 'kernel': 'rbf', 'gamma': 0.1, 'C': 10 Gaussian Naive Bayes 'var_smoothing': 1e-05, 'priors': None 'var_smoothing':1e-05,'priors': None Logistic Regression 'warm_start': False, 'solver': 'lbfgs', 'penalty': None, 'multi_class': 'ovr', 'max_iter': 200, 'class_weight': None, 'C': 1 'warm_start': False, 'solver': 'lbfgs', 'penalty': None, 'multi_class': 'multinomial', 'max_iter': 300, 'class_weight': 'balanced', 'C': 1 Decision Tree 'splitter':'random','min_samples_split': 2,'min_samples_leaf':2,'max_features': 'log2', 'max_depth': 10, 'criterion': 'entropy', 'class_weight': None 'splitter': 'best', 'min_samples_split': 6, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 30, 'criterion': 'entropy', 'class_weight': 'balanced' Adaboosting 'random_state':42,'n_estimators': 50,'learning_rate':0.01,'algorithm': 'SAMME' 'random_state': 42, 'n_estimators': 200, 'learning_rate': 1.0, 'algorithm': 'SAMME' K-Nearest Neighbor 'weights': 'uniform', 'n_neighbors': 9, 'metric': 'manhattan' 'weights': 'uniform', 'n_neighbors': 3, 'metric': 'manhattan' XGBoost 'subsample': 0.8, 'n_estimators': 100, 'min_child_weight': 2, 'max_depth': 6, 'learning_rate':0.05,'colsample_bytree': 0.7 'subsample': 0.8, 'n_estimators': 300, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': 0.2, 'colsample_bytree': 0.8 Gradient Boosting 'learning_rate': 0.01, 'max_depth': 6, 'max_features':None,'min_samples_leaf': 1, 'min_samples_split': 6, 'n_estimators': 400, 'subsample': 0.8 'learning_rate': 0.2, 'max_depth': 9, 'max_features':None,'min_samples_leaf': 17, 'min_samples_split': 7, 'n_estimators': 500, 'subsample': 0.9 Results Without SMOTE In this part, we report the performance metrics of all selected models after training on the original imbalanced dataset. Each model was evaluated according to its accuracy, precision, recall, and F1-score. The objective is to assess the models' performance on the unprocessed data prior to implementing any resampling or balancing techniques. The outcomes shown in Table 4 come from the best combos of hyperparameters that were found by Randomized Search optimization. Table 4 Performance Metrics of Models Without SMOTE Model Name Accuracy (%) Precision Recall F1-Score 0 1 0 1 0 1 Random Forest 92.03 0.92 0.56 1.00 0.02 0.96 0.04 Support Vector Machine 91.99 0.92 0.00 1.00 0.00 0.96 0.00 Gaussian Naive Bayes 85.19 0.94 0.23 0.89 0.36 0.92 0.28 Logistic Regression 91.99 0.92 0.00 1.00 0.00 0.96 0.00 Decision Tree 91.94 0.92 0.39 1.00 0.01 0.96 0.02 Adaboosting 91.99 0.92 0.00 1.00 0.00 0.96 0.00 K-Nearest Neighbor 91.85 0.92 0.32 1.00 0.02 0.96 0.03 XGBoost 92.04 0.92 0.55 1.00 0.03 0.96 0.05 Gradient Boosting 92.01 0.92 0.52 1.00 0.03 0.96 0.05 Stacking 91.86 0.93 0.46 0.99 0.10 0.96 0.16 Although the majority of models attain a high overall accuracy (~ 92%), their efficacy in identifying the minority class (1) is markedly inadequate, as evidenced by diminished precision, recall, and F1-score metrics. This is a characteristic indication of imbalanced data, when models exhibit bias towards the dominant class. The Random Forest model had slightly superior performance (92.03% accuracy) in identifying class 1 relative to other classes, although it still lacks balanced efficacy. This necessitates the subsequent implementation of SMOTE to equilibrate the dataset and enhance minority class identification. Results With SMOTE Subsequent to the application of the SMOTE strategy to equilibrate the class distribution within the dataset, each machine learning model was re-trained and assessed utilizing the same experimental framework. The principal objective of employing SMOTE was to enhance the detection efficacy of minority class instances, which were markedly underrepresented in the initial dataset. The subsequent Table 5 encapsulates the performance metrics—accuracy, precision, recall, and F1-score—for both classes (0 and 1) across all models trained with SMOTE. Table 5 Performance Metrics of Models With SMOTE Model Name Accuracy (%) Precision Recall F1-Score 0 1 0 1 0 1 Random Forest 89.50 0.91 0.88 0.90 0.88 0.91 0.88 Support Vector Machine 59.30 0.58 0.82 0.98 0.11 0.73 0.19 Gaussian Naive Bayes 75.21 0.79 0.71 0.75 0.75 0.77 0.73 Logistic Regression 74.24 0.79 0.69 0.72 0.77 0.76 0.73 Decision Tree 83.59 0.88 0.79 0.82 0.86 0.85 0.82 Adaboosting 80.49 0.83 0.78 0.82 0.79 0.82 0.78 K-Nearest Neighbor 89.38 0.99 0.81 0.82 0.99 0.90 0.89 XGBoost 88.00 0.89 0.87 0.90 0.95 0.89 0.86 Gradient Boosting 93.23 0.92 0.94 0.96 0.90 0.94 0.92 Stacking 94.49 0.96 0.93 0.94 0.95 0.95 0.94 The results indicate that the application of SMOTE significantly enhanced the precision, recall, and F1-score metrics for class 1 in the majority of models. Models like as Random Forest, Gradient Boosting, and Stacking attained equitable precision and recall metrics for both classes, signifying strong performance on the balanced dataset. The Stacking model far surpassed all others [ 23 ], [ 24 ], [ 25 ], achieving an F1-score of 0.94 for the minority class and an overall accuracy of 94.49%. Likewise, Gradient Boosting demonstrated robust performance across all criteria. In contrast to results obtained without SMOTE, these findings validate that class balancing enhanced the models' generalization, especially in recognizing minority class instances. Nonetheless, SVM inadequately identified class 1, suggesting that SMOTE alone may not suffice for specific algorithms without additional tuning or feature optimization. Comparative Analysis: With vs Without SMOTE The primary objective is to evaluate how SMOTE affects the ability of each model to detect the minority class by comparing metrics such as accuracy, precision, recall, and F1-score before and after SMOTE was applied. Additionally, this section discusses trade-offs in performance and time cost due to class balancing. After using SMOTE, there is a clear trend across practically all models: the recall and F1-score for class 1 (the minority class) got a significant improvement depicts in Fig. 6 . With SMOTE, Stacking had the highest F1-score (0.94) and accuracy (94.49%) of all the models, including Gradient Boosting, XGBoost, K-Nearest Neighbor, and Stacking. Before SMOTE, models like Random Forest and KNN had good accuracy but low recall and F1-score for class 1, indicating bias toward the majority class. These models become more balanced using SMOTE, enhancing recall without compromising precision. Figure 7 presents a side by side model comparison of the classification accuracies trained with and without SMOTE. It clearly showing that SMOTE impacted the accuracies on each model. While some models, like Support Vector Machine and Logistic Regression, show a significant drop in accuracy after applying SMOTE, others such as Gradient Boosting and Stacking maintain high performance or even show improvement. The Support Vector Machine continued to perform poorly even after SMOTE, as evidenced by its low recall for class 1. This suggests that SMOTE may not be sufficient for certain linear models or may necessitate additional refining. While ensemble-based models benefited the most from SMOTE. Execution Time Comparison Table 6 shows the training durations for each model, with and without SMOTE. Applying SMOTE generally increased training time due to the added overhead of generating synthetic samples, with complex models like Support Vector Machine, Random Forest, and Stacking showing the largest increases. Table 6 Execution Time Comparison of Machine Learning Models Model Name Without SMOTE (sec) With SMOTE (sec) Random Forest 767.63 1650.69 Support Vector Machine 3035.50 6455.76 Gaussian Naive Bayes 6.78 10.31 Logistic Regression 1088.55 2336.80 Decision Tree 45.15 98.28 Adaboosting 464.39 1022.00 K-Nearest Neighbor 606.35 1588.47 XGBoost 210.78 347.63 Gradient Boosting 2348.31 23485.31 Stacking 175.85 1127.75 Conclusion This study aims to create a model that will predict an early stage of cardiovascular disease depending on some predefined parameters. This process works with 308,070 patients' information done with data cleaning and encoding processes. The Boruta algorithm was applied to find the importance for this study as to which feature is more impactful. Data imbalance plays a pivotal role in model accuracies, and to balance data, we applied the SMOTE method. Total models were trained both using and without using SMOTE. And we find a spectacular distinction between these two processes. When the process was trained without SMOTE, we found a significantly low amount of precision, recall, and F1-score for some models. Though the accuracies were higher, they are not well justified, as the accuracy metrics aren’t sufficient enough. For the majority class, each model was showing nearly 1 value. But for the minority class, the precision, recall, and F1-score are (0.56, 0.02, 0.04), (0.23, 0.36, 0.28), (0.32, 0.03, 0.02), (0.39, 0.01, 0.02), (0.52, 0.03, 0.05), and (0.52, 0.01, 0.02), respectively for Random Forest, Gaussian Naive Bayes, Xtreme Gradient Boosting, Decision Tree, Gradient Boosting, and Stacking. And most surprisingly (0.00, 0.00, 0.00) for SVM, LR and Adaboosting. Having these malicious or faulty metrics XGB, RF, GB got 92.04%, 92.03%, 92.01% accuracies and except GNB model rest of the models got above 91% accuracies. On the other side, after solving the data imbalance problem applying SMOTE, we noticed a significant improves almost in every model. Though the model accuracy decreased acutely but the accuracy metrics like precision, recall and F1-score climbed for minority class (0.88, 0.88, 0.88), (0.71, 075, 0.73), (0.91, 0.87, 0.89), (0.79, 0.85, 0.82), (0.94, 0.90, 0.92), (0.92, 0.94, 0.93), (0.82, 0.11, 0.19), (0.69, 0.77, 0.73), (0.79, 0.80, 0.80) respectively for Random Forest, Gaussian naive bayes, Xtreme gradient boosting, Decision Tree, gradient boosting, Stacking, SVM, LR, Adaboosting. In the case of using SMOTE, Stacking overweight all others models and achieved highest accuracy getting 94.00%. GB, RF, KNN, XGB also achieved a good amount of accuracies such as 93.23%, 89.50%, 89.38% and 88% respectively. The study utilized a single dataset source (BRFSS), potentially limiting generalizability across different populations and healthcare systems. The increased computational overhead associated with SMOTE implementation, particularly evident in complex models like Gradient Boosting (10x training time increase), may pose practical deployment challenges in resource-constrained environments. Additionally, while SMOTE effectively addressed class imbalance, alternative resampling techniques were not explored. Future research should focus on: Cross-dataset validation across multiple international cardiovascular datasets to establish generalizability; Advanced resampling techniques investigation, including ADASYN and Borderline-SMOTE; Deep learning integration with SMOTE for enhanced feature representation; Real-time deployment studies in clinical decision support systems; Explainable AI development for interpretable healthcare predictions; and Cost-sensitive learning exploration as alternatives to resampling approaches. These directions will advance the field toward more robust, clinically-applicable cardiovascular risk assessment systems. Declarations Author Contribution Conceptualization: Md. Saon Sikder; Methodology: Md. Saon Sikder; Formal analysis and investigation: Engr. Md. Emad Uddin Aksir; Project administration: Engr. Md. Emad Uddin Aksir; Validation: Md. Saon Sikder, Engr. Md. Emad Uddin Aksir; Visualization: Md. Saon Sikder, Engr. Md. Emad Uddin Aksir; Writing - original draft preparation: Md. Saon Sikder; Writing - review and editing: Engr. Md. Emad Uddin Aksir Data Availability "Behavioral Risk Factor Surveillance System (BRFSS)" through Kaggle website that contains the dataset published in 2021: https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset References “Cardiovascular diseases (CVDs).” Accessed: Jun. 09, 2025. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds ) “1.” Accessed: Jun. 22, 2025. [Online]. Available: https://my.clevelandclinic.org/health/diseases/21493-cardiovascular-disease S. K. P. -, “Real-time Analytics and Clinical Decision Support Systems: Transforming Emergency Care,” Int. J. Multidiscip. Res. , vol. 6, no. 6, Nov. 2024, doi: 10.36948/IJFMR.2024.V06I06.31500 . R. T. Sutton, D. Pincock, D. C. Baumgart, D. C. Sadowski, R. N. Fedorak, and K. I. Kroeker, “An overview of clinical decision support systems: benefits, risks, and strategies for success,” NPJ Digit. Med. , vol. 3, no. 1, p. 17, Dec. 2020, doi: 10.1038/S41746-020-0221-Y . “benefit2.” Accessed: Jul. 01, 2025. [Online]. Available: https://www.narayanahealth.org/blog/benefits-of-early-heart-disease-detection B. Ristevski and M. Chen, “Big Data Analytics in Medicine and Healthcare,” J. Integr. Bioinform. , vol. 15, no. 3, p. 20170030, May 2018, doi: 10.1515/JIB-2017-0030/PDF . K. Batko and A. Ślęzak, “The use of Big Data Analytics in healthcare,” J. Big Data , vol. 9, no. 1, pp. 1–24, Dec. 2022, doi: 10.1186/S40537-021-00553-4/TABLES/11 . K. P. Kresoja, M. Unterhuber, R. Wachter, H. Thiele, and P. Lurz, “A cardiologist’s guide to machine learning in cardiovascular disease prognosis prediction,” Basic Res. Cardiol. , vol. 118, no. 1, Dec. 2023, doi: 10.1007/S00395-023-00982-7 ,. S. A. J. Zaidi, A. Ghafoor, J. Kim, Z. Abbas, and S. W. Lee, “HeartEnsembleNet: An Innovative Hybrid Ensemble Learning Approach for Cardiovascular Risk Prediction,” Healthc. 2025, Vol. 13, Page 507 , vol. 13, no. 5, p. 507, Feb. 2025, doi: 10.3390/HEALTHCARE13050507 . A. Alqahtani, S. Alsubai, M. Sha, L. Vilcekova, and T. Javed, “Cardiovascular Disease Detection using Ensemble Learning,” Comput. Intell. Neurosci. , vol. 2022, pp. 1–9, Aug. 2022, doi: 10.1155/2022/5267498’) ). T. Liu, A. Krentz, L. Lu, and V. Curcin, “Machine learning based prediction models for cardiovascular disease risk using electronic health records data: systematic review and meta-analysis,” Eur. Hear. J. - Digit. Heal. , vol. 6, no. 1, pp. 7–22, Jan. 2025, doi: 10.1093/EHJDH/ZTAE080 . M. Chavan, S. K. Singh, S. Bansod, and P. Pal, “Design and Implementation of Heart Disease Prediction Using Artificial Neural Network,” Proc. 8th IEEE Int. Conf. Sci. Technol. Eng. Math. ICONSTEM 2023 , 2023, doi: 10.1109/ICONSTEM56934.2023.10142267 . S. P. Patro, N. Padhy, and D. Chiranjevi, “Ambient assisted living predictive model for cardiovascular disease prediction using supervised learning,” Evol. Intell. , vol. 14, no. 2, pp. 941–969, Jun. 2021, doi: 10.1007/S12065-020-00484-8/METRICS . A. Singh and R. Kumar, “Heart Disease Prediction Using Machine Learning Algorithms,” Int. Conf. Electr. Electron. Eng. ICE3 2020 , pp. 452–457, Feb. 2020, doi: 10.1109/ICE348803.2020.9122958 . K. S. L. Prasanna, N. P. Challa, and J. Nagaraju, “Heart Disease Prediction using Reinforcement Learning Technique,” 2023 3rd Int. Conf. Adv. Electr. Comput. Commun. Sustain. Technol. ICAECT 2023 , 2023, doi: 10.1109/ICAECT57570.2023.10118232 . H. V. Ramesh and R. K. Pathinarupothi, “Performance Analysis of Machine Learning Algorithms to Predict Cardiovascular Disease,” 2023 IEEE 8th Int. Conf. Converg. Technol. I2CT 2023 , 2023, doi: 10.1109/I2CT57861.2023.10126428 . K. K. Gola and S. Arya, “Satin Bowerbird Optimization-Based Classification Model for Heart Disease Prediction Using Deep Learning in E-Healthcare,” Proc. – 23rd IEEE/ACM Int. Symp. Clust. Cloud Internet Comput. Work. CCGridW 2023, pp. 296–298, 2023, doi: 10.1109/CCGRIDW59191.2023.00063 . S. H. Bani Hani and M. M. Ahmad, “Machine-learning Algorithms for Ischemic Heart Disease Prediction: A Systematic Review,” Curr. Cardiol. Rev. , vol. 19, no. 1, pp. 87–99, Jun. 2022, doi: 10.2174/1573403X18666220609123053/CITE/REFWORKS . “Cardiovascular Diseases Risk Prediction Dataset.” Accessed: Jul. 30, 2025. [Online]. Available: https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res. , vol. 16, pp. 321–357, 2002, doi: 10.1613/JAIR.953 . J. Huang, N. Huang, L. Zhang, and H. Xu, “A method for feature selection based on the correlation analysis,” Proc. 2012 Int. Conf. Meas. Inf. Control. MIC 2012, vol. 1, pp. 529–532, 2012, doi: 10.1109/MIC.2012.6273357 . J. Bergstra, J. B. Ca, and Y. B. Ca, “Random Search for Hyper-Parameter Optimization,” J. Mach. Learn. Res. , vol. 13, no. 10, pp. 281–305, 2012, Accessed: Aug. 05, 2025. [Online]. Available: http://jmlr.org/papers/v13/bergstra12a.html B. Naderalvojoud and T. Hernandez-Boussard, “Improving machine learning with ensemble learning on observational healthcare data,” AMIA Annu. Symp. Proc. , vol. 2023, p. 521, 2024, Accessed: Aug. 16, 2025. [Online]. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC10785929/ M. S. H. Rabbi et al. , “Performance evaluation of optimal ensemble learning approaches with PCA and LDA-based feature extraction for heart disease prediction,” Biomed. Signal Process. Control , vol. 101, p. 107138, Mar. 2025, doi: 10.1016/J.BSPC.2024.107138 . P. S. Mung and S. Phyu, “Ensemble learning method for enhancing healthcare classification,” WCSE 2020 2020 10th Int. Work. Comput. Sci. Eng. , pp. 652–656, 2020, doi: 10.18178/WCSE.2020.02.024 . Additional Declarations No competing interests reported. Supplementary Files Tables.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7428299","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":504476492,"identity":"c7ee6872-7f9e-4f16-92c7-9c156c5cdc18","order_by":0,"name":"Md. Saon Sikder","email":"","orcid":"","institution":"Faridpur Engineering College","correspondingAuthor":false,"prefix":"","firstName":"Md.","middleName":"Saon","lastName":"Sikder","suffix":""},{"id":504476493,"identity":"f61ddb22-ed6c-4864-8e02-2e009cdfe12c","order_by":1,"name":"Engr. Md. Emad Uddin Aksir","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/klEQVRIiWNgGAWjYJACCQY2Zh4GBjYGhgQGGyCfsfEAKVrSQFoaiNLCANbCwHAYLIJXC7/Y4Ye3ecqsZczZjyV/eLjnvN3a9sNAW2psonFpkZydZmzNcy6dx7In7ZhEwrPbydvOJAK1HEvLbcChxeB2gpk0b9thHoMD6W0MCQduJ5sdAGphbDiMU4v97fRvEC3nnzd/SDhwLtns/EP8Wgykc6C23Eg7IJFw4ICd2Q0Ctkjczim2nAP0i8GNZ2lALckJZjeAtiTg8Qv/7PSNN96UWdsbnE8z/vjjgJ292fn0hw8+1Njg1IIBEsEqE4hVDgL2pCgeBaNgFIyCkQEAHaxkdXfS75YAAAAASUVORK5CYII=","orcid":"","institution":"Jahangirnagar University","correspondingAuthor":true,"prefix":"","firstName":"Engr.","middleName":"Md. Emad","lastName":"Uddin","suffix":"Md."}],"badges":[],"createdAt":"2025-08-21 17:08:20","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7428299/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7428299/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":89989930,"identity":"87c9c1d8-ce6c-4ecf-83d3-14b4c9da665f","added_by":"auto","created_at":"2025-08-27 07:09:40","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":133722,"visible":true,"origin":"","legend":"\u003cp\u003eThe Cardiovascular Dataset after Converting Categorical Variables to Numerical\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/f23a1594cfcd071c424f9c8c.png"},{"id":89990022,"identity":"4ae6ed20-81b7-4dba-9c8a-65796cbf116f","added_by":"auto","created_at":"2025-08-27 07:09:52","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":27887,"visible":true,"origin":"","legend":"\u003cp\u003eCardio_Disease class distribution before and after applying SMOTE\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/5b9fc874841dda8b8c814a8a.png"},{"id":89989862,"identity":"cc85229b-f6f9-4ee3-aa58-b777afff05fe","added_by":"auto","created_at":"2025-08-27 07:09:37","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":136498,"visible":true,"origin":"","legend":"\u003cp\u003ePearson Correlation Matrix of All Features\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/b59d805622f5bad921404c9b.png"},{"id":89989935,"identity":"ef56da67-3bcd-4216-9544-a6009bd876ee","added_by":"auto","created_at":"2025-08-27 07:09:40","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":57576,"visible":true,"origin":"","legend":"\u003cp\u003eFeature Importance Scores from Boruta Algorithm\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/0ede98126901edc21bac86ee.png"},{"id":89989981,"identity":"0b218a0b-8e4b-4444-936c-0f7d62846d4b","added_by":"auto","created_at":"2025-08-27 07:09:46","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":50727,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of the Experimental Workflow for Model Training, SMOTE Evaluation, and Performance Comparison\u003c/p\u003e","description":"","filename":"image5.png","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/abc36fc3ce1f7026d9cbd104.png"},{"id":89989947,"identity":"3b5f7f37-a6f9-468a-97bc-7c7cc2d41d76","added_by":"auto","created_at":"2025-08-27 07:09:42","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":211171,"visible":true,"origin":"","legend":"\u003cp\u003ePrecision, Recall and F1-score Trend: With vs Without SMOTE\u003c/p\u003e","description":"","filename":"image6.png","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/50ec4ee4ae61b7178af29e92.png"},{"id":89990035,"identity":"a72dbf73-3a06-404e-8c3f-6bf06c69fd83","added_by":"auto","created_at":"2025-08-27 07:09:58","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":31641,"visible":true,"origin":"","legend":"\u003cp\u003eModel Performance Comparison: With and Without SMOTE\u003c/p\u003e","description":"","filename":"image7.png","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/49127792f9e214bcbbd84db7.png"},{"id":90422803,"identity":"76013470-5054-4c02-b470-a9e2e5289f38","added_by":"auto","created_at":"2025-09-02 14:17:08","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1479597,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/b8b35337-e904-4696-b15d-22ee8ba1fdeb.pdf"},{"id":89991052,"identity":"2dabfd91-94ae-40c0-8ff2-30bd7486dbc4","added_by":"auto","created_at":"2025-08-27 07:17:42","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":16602,"visible":true,"origin":"","legend":"","description":"","filename":"Tables.docx","url":"https://assets-eu.researchsquare.com/files/rs-7428299/v1/7af85023606a14f8f2a87887.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eMachine Learning-Based Cardiovascular Disease Prediction: Comparative Analysis of SMOTE Impact on Imbalanced Healthcare Data\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eCardiovascular disease (CVD) stands as the principal global reason for mortality according to the World Health Organization (WHO). Every year globally Cardiovascular disease (CVD) leads to around 18\u0026nbsp;million deaths, a percentage is declared about 31% [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Cardiovascular disease is a well-defined range of diseases that lead to the pathology of the heart and vascular system, which in most cases compromise a number of anatomical or physiological elements. The clinical manifestations are varied; some patients show distinct symptoms while others have none; making it difficult to diagnose and come up with intervention in time. Such conditions as vascular stenosis in the coronary, peripheral or systemic circulation, malformations of cardiac or vascular nature, structural defects in the functioning of the valves, disorders related to electrical conduction of the heart and allowing its individual disruptions in the form of arrhythmias fall under the category [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThere is no perfect reason of the cardiovascular disease (CVD) because it is a multifaceted disease and many of the risk factors are interacting. Most of these risk factors have a lot of connection with the lifestyle of an individual and their behavioral facts, which can be corrected to some extent. With the early detection of these parameters, assessment could be conducted in advance to determine the risk probability of individuals to develop CVD thus the intervention to minimize such risk could be conducted early. Among predictive factors, age, general health condition, the amount of physical activity, the history of smoking and alcohol drinking, and the comorbidities, including diabetes should be considered the factors commonly identified as essential in predictive modeling of the CVD.\u003c/p\u003e\u003cp\u003eClinicians tend to begin with simple tests by investigating into your family history and lifestyle factors. They do so in order to know what type of risk you are exposed to regarding your heart. This procedure may be time consuming and costly. That is why now more individuals propose the application of machine learning and data analytics in order to detect heart disease at an earlier stage. Such tools can support the quality of care decision making by physicians since they will be able to identify warning signs of care well in advance [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e]. This would reduce the possibility of improper diagnosis and would allow physicians to intervene before it is too late to take care of the patient [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eThe prediction of cardiovascular disease (CVD) using machine learning (ML) and deep learning (DL) techniques has gained significant attention in recent years due to the rising global burden of heart-related illnesses. With advancements in data analytics and healthcare technology, researchers have focused on developing models that leverage clinical, demographic, and lifestyle data to enhance early detection and improve patient outcomes [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e], [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. The paper highlights the diversity of approaches, from traditional ML classifiers to advanced DL architectures, and their contributions to improving CVD prediction.\u003c/p\u003e\u003cp\u003eA study explored seven ML classifiers, including Random Forest (RF), Logistic Regression (LR), and Support Vector Machine (SVM), using Kaggle and IEEE DataPort datasets, with RF achieving the highest accuracy of 91.67% after hyperparameter tuning with RandomizedSearchCV [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Another paper proposed a hybrid ensemble model combining RF, Gradient Boosting, and Gaussian Naive Bayes, reporting an accuracy of 89.5% with improved precision and recall through ensemble voting [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. Similarly, a convolutional neural network (CNN) model with multimodal data integration achieved 94.8% accuracy, outperforming unimodal CNNs due to its faster convergence and robust feature extraction [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Ensemble techniques, such as bagging and stacking, were also investigated, with XGBoost achieving a standout accuracy of 92.72% for myocardial infarction prediction, highlighting the effectiveness of gradient boosting methods [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eSeveral studies emphasized the role of feature selection and hyperparameter optimization. A paper applied k-modes clustering with Huang initialization alongside RF and XGBoost, achieving 90.8% accuracy on the Cleveland dataset, underscoring the importance of clustering for data preprocessing [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Another work used Harris Hawks Optimization (HHO) with Artificial Neural Networks (ANNs), reporting 88.9% accuracy and improved F1-scores through optimized feature selection [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. A systematic review and meta-analysis of ML models using electronic health records (EHRs) found that SVM and RF consistently outperformed traditional risk scores like QRISK3, with accuracies ranging from 85\u0026ndash;90% and enhanced recall through feature engineering [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Deep learning models, such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), were compared in a recent study, with the hybrid CNN-LSTM model achieving 92% accuracy, leveraging SHAP for interpretability [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eSome papers focused on specific risk factors or novel datasets. Another study highlighted Body Mass Index (BMI) as a critical predictor, using RF and LR to achieve 87.5% accuracy, with GridSearchCV for hyperparameter tuning [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Another work utilized reinforcement learning to predict heart disease, achieving 86% accuracy with a focus on dynamic feature weighting [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. A paper on wearable IoT devices and ML reported accuracies above 90% with neural networks, emphasizing real-time monitoring capabilities [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. The use of blockchain for secure data sharing in CVD prediction was explored, with DL models achieving 89% accuracy [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eOther studies addressed specific CVD subtypes or innovative methods. A paper on ischemic heart disease used ML classifiers like SVM and RF, achieving 88% accuracy with feature ranking techniques [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. A Satin Bowerbird Optimization-based DL model for e-healthcare applications reported 90.5% accuracy, focusing on scalability in real-world settings [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e"},{"header":"Methodology","content":"\u003cp\u003eTo complete this thesis, we undertook a rigorous methodology involving several key stages. We began with data management and preprocessing to ensure the quality and readiness of our dataset. This was followed by crucial steps in feature engineering and hyperparameter tuning to optimize our models, leading into the model implementation and thorough performance evaluation of our results.\u003c/p\u003e\n\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n \u003ch2\u003eData Collection and Description\u003c/h2\u003e\n \u003cp\u003eTo commence the model evaluation a dataset was collected from \u0026ldquo;Behavioral Risk Factor Surveillance System (BRFSS)\u0026rdquo; through Kaggle website that contains the dataset published in 2021 [\u003cspan class=\"CitationRef\"\u003e19\u003c/span\u003e]. This unique categorical dataset comprising 308,070 rows and 17 columns encompass a variety of health-related attributes and potential risk factors associated with Cardiovascular diseases about U.S. residents. The features and the description of the dataset showed in the Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eDescription of the dataset features\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFeature\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eExplanation\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAge_Category\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAge in complete years from 18 to 80+\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGeneral_Health\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePoor, Very Good, Fair, Good, Excellent\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eExercise\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYes or No (Yes\u0026thinsp;=\u0026thinsp;1, No\u0026thinsp;=\u0026thinsp;0)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSkin_Cancer\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYes or No (Yes\u0026thinsp;=\u0026thinsp;1, No\u0026thinsp;=\u0026thinsp;0)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eOther_Cancer\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYes or No (Yes\u0026thinsp;=\u0026thinsp;1, No\u0026thinsp;=\u0026thinsp;0)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDepression\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYes or No (Yes\u0026thinsp;=\u0026thinsp;1, No\u0026thinsp;=\u0026thinsp;0)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eArthritis\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYes or No (Yes\u0026thinsp;=\u0026thinsp;1, No\u0026thinsp;=\u0026thinsp;0)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSex\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFemale or Male (Female\u0026thinsp;=\u0026thinsp;0, Male\u0026thinsp;=\u0026thinsp;1)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHeight_(cm)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHeight in Centimeter\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWeight_(kg)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eWeight in Kilo gram\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBMI\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBody Mass index floating value\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSmoking_History\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYes or No (Yes\u0026thinsp;=\u0026thinsp;1, No\u0026thinsp;=\u0026thinsp;0)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAlcohol_Consumption\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAlcohol consumption in milliliter\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGreen_Vegetables_Consumption\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGreen vegetables in Gram\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFruit_Consumption\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eFruits in Gram\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDiabetes\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eNo\u0026thinsp;=\u0026thinsp;0, No, pre-diabetes or borderline diabetes\u0026thinsp;=\u0026thinsp;1,\u003c/p\u003e\n \u003cp\u003eYes\u0026thinsp;=\u0026thinsp;2, Yes, but female told only during pregnancy\u0026thinsp;=\u0026thinsp;3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eCardio_Disease\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eYes or No (Yes\u0026thinsp;=\u0026thinsp;1, No\u0026thinsp;=\u0026thinsp;0)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003eData Pre-processing\u003c/h3\u003e\n\u003cp\u003eTo achieve optimal output from the dataset, it is necessary to do data cleaning, transform the data into an appropriate format, and structure it for model utilization; this process is referred to as data pre-processing in machine learning. Initially, the presence of null values was checked for, but none were found. However, we identified several duplicate values, which have been eliminated. Most machine learning models cannot process categorical or text-based data. Therefore, we employed the Label Encoding approach to turn categorical columns into numerical values shown in the Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e for processing by the algorithms.\u003c/p\u003e\n\u003ch3\u003eData Balancing using SMOTE\u003c/h3\u003e\n\u003cp\u003eA significant problem in medical datasets, especially those associated with diseases classification, is the imbalanced class distribution, wherein one class (e.g., absence of disease) substantially exceeds the other class (e.g., presence of disease). This study revealed an enormous class imbalance in the target variable \u0026ldquo;Cardio_Disease,\u0026rdquo; which affected the fairness and accuracy of machine learning models, especially in detecting cases of the minority class. The primary dataset exhibited a notable class imbalance, comprising 283,883 instances of \u0026apos;No\u0026apos; and 24,971 instances of \u0026apos;Yes\u0026apos;, signifying a bias towards the negative class.\u003c/p\u003e\n\u003cp\u003eTo address this, we utilized the Synthetic Minority Over-sampling Technique (SMOTE) [\u003cspan class=\"CitationRef\"\u003e20\u003c/span\u003e]. SMOTE mitigates class imbalance by creating synthetic samples for the minority class instead of replicating existing instances. The dataset has 283,103 occurrences of \u0026apos;No\u0026apos; and 226,482 instances of \u0026apos;Yes\u0026apos; for \u0026apos;Cardio_Disease\u0026apos;, resulting in a balanced distribution in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e. Synthetic samples are generated by interpolating between existing examples of the minority class, thereby enhancing variability and mitigating overfitting.\u003c/p\u003e\n\u003cp\u003eThe complete machine learning pipeline was conducted twice for comprehensive experimentation.\u003c/p\u003e\n\u003cul\u003e\n \u003cli\u003e\n \u003cp\u003eOnce on the original imbalanced dataset\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eOnce on the SMOTE-balanced dataset\u003c/p\u003e\n \u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThis dual-path methodology enabled an extensive comparison of performance measures both before and after to the use of SMOTE, which represents the cornerstone of our analysis.\u003c/p\u003e\n\u003ch3\u003eFeature Engineering\u003c/h3\u003e\n\u003cp\u003eFeature engineering involves many strategies designed to enhance model performance through the optimization of input data. In this study, we employed two principal feature engineering techniques: correlation analysis and wrapper-based feature selection.\u003c/p\u003e\n\u003ch3\u003eCorrelation Matrix Analysis\u003c/h3\u003e\n\u003cp\u003eA Pearson correlation matrix was developed to clarify the linear correlations among the features [\u003cspan class=\"CitationRef\"\u003e21\u003c/span\u003e], as depicted in Fig.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e This matrix elucidates the extent of correlation between variables, with values spanning from \u0026minus;\u0026thinsp;1 (perfect negative correlation) to +\u0026thinsp;1 (perfect positive correlation). A number close to zero signifies minimal to no linear correlation.\u003c/p\u003e\n\u003cp\u003eThe analysis demonstrated a robust positive association between Weight (kg) and BMI (r\u0026thinsp;=\u0026thinsp;0.86), a moderately strong correlation between Height (cm) and Weight (kg) (r\u0026thinsp;\u0026asymp;\u0026thinsp;0.47), and a link between Height (cm) and Sex (r\u0026thinsp;\u0026asymp;\u0026thinsp;0.70). Such elevated correlations indicate the possibility of multicollinearity, especially in linear models. While tree-based models can manage correlated features more effectively, precautions were implemented to prevent substantially redundant features from compromising model interpretability.\u003c/p\u003e\n\u003cp\u003eAlthough some features had minimal connection with the target variable Cardio_Disease, they were not promptly eliminated. Low Pearson correlation does not inherently indicate a deficiency in predictive power, particularly in non-linear models. Subsequent feature selection was conducted with the Boruta algorithm, which evaluates feature significance concerning model efficacy.\u003c/p\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003eFeature Selection with Boruta Algorithm\u003c/h2\u003e\n \u003cp\u003eThe Boruta feature selection method was used to find the most important traits for predicting cardiovascular disease. Boruta is a strong feature selection method that works with a Random Forest classifier. It works by making shadow features, which are random replicas of the original features, and then comparing their importance scores to those of the real features. Important features are those that continuously do better than their shadow equivalents.\u003c/p\u003e\n \u003cp\u003eFigure\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e shows that the Boruta analysis found that Age_Category, Diabetes, Arthritis, General_Health, and Smoking_History were some of the most important predictors. Conversely, factors like Fruit_Consumption, Green_Vegetables_Consumption, and Depression demonstrated negligible significance and were hence omitted from the final model input. This process ensured that the models were trained using only statistically significant variables, reducing the dimensionality of the dataset and potentially enhancing model performance and generalization.\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003eTrain-Test Split\u003c/h3\u003e\n\u003cp\u003eTo critically evaluate model performance, the preprocessed and feature-selected dataset was divided into training and testing sets in an 80:20 ratio. A stratified sampling method was utilized to maintain the class distribution in both groups. This guarantees that the models are trained on a subset of the data and assessed on novel instances, so mitigating overfitting and enhancing generalizability.\u003c/p\u003e\n\u003ch3\u003eHyperparameter Tuning\u003c/h3\u003e\n\u003cp\u003eHyperparameter optimization is essential for enhancing the efficacy and generalizability of machine learning models. Within this study, we utilized RandomizedSearchCV, which is a randomized search algorithm that samples from the given hyperparameter distributions over a fixed number of iterations. Contrary to exhaustive grid search, randomized search is computationally more manageable yet still examines an extensive configuration space [\u003cspan class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eFor each of the models, a randomized search was conducted with 5-fold cross-validation and an evaluation metric based on classification accuracy. This enabled a rigorous testing of model performance across a variety of parameter settings shown in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e. The search space for each of the algorithms was drawn from existing work and best practice in clinical predictive modeling.\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n \u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n \u003ctable id=\"Tab3\" border=\"1\"\u003e\n \u003ccaption\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eParameters used for each model in the Without SMOTE and With SMOTE method\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eModels\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eParameters\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRandom Forest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;n_estimators\u0026apos;: [400,500,600], \u0026apos;min_samples_split\u0026apos;: randint(2, 20),\u0026nbsp; \u0026apos;min_samples_leaf\u0026apos;: randint(1, 20),\u0026nbsp; \u0026apos;max_features\u0026apos;: [\u0026apos;auto\u0026apos;, \u0026apos;sqrt\u0026apos;], \u0026apos;bootstrap\u0026apos;: [True, False]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSupport Vector Machine\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;C\u0026apos;: [0.1, 1, 10, 100], \u0026apos;gamma\u0026apos;: [1, 0.1, 0.01, 0.001], \u0026apos;kernel\u0026apos;: [\u0026apos;rbf\u0026apos;, \u0026apos;poly\u0026apos;, \u0026apos;sigmoid\u0026apos;]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGaussian Naive Bayes\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;priors\u0026apos;: [None, [0.3, 0.7], [0.6, 0.4]], \u0026apos;var_smoothing\u0026apos;: [1e-9, 1e-8, 1e-7, 1e-6, 1e-5]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLogistic Regression\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;C\u0026apos;: [0.001, 0.01, 0.1, 1, 10, 100], \u0026apos;penalty\u0026apos;: [None, \u0026apos;l1\u0026apos;, \u0026apos;l2\u0026apos;, \u0026apos;elasticnet\u0026apos;], \u0026lsquo;solver\u0026apos;: [\u0026apos;newton-cg\u0026apos;, \u0026apos;lbfgs\u0026apos;, \u0026apos;liblinear\u0026apos;, \u0026apos;sag\u0026apos;, \u0026apos;saga\u0026apos;], \u0026apos;max_iter\u0026apos;: [100, 200, 300], \u0026apos;multi_class\u0026apos;: [\u0026apos;auto\u0026apos;, \u0026apos;ovr\u0026apos;, \u0026apos;multinomial\u0026apos;], \u0026apos;class_weight\u0026apos;: [\u0026apos;balanced\u0026apos;, None], \u0026apos;warm_start\u0026apos;: [True, False]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eDecision Tree\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;criterion\u0026apos;: [\u0026apos;gini\u0026apos;, \u0026apos;entropy\u0026apos;], \u0026apos;splitter\u0026apos;: [\u0026apos;best\u0026apos;, \u0026apos;random\u0026apos;], \u0026apos;max_depth\u0026apos;: [None, 10, 20, 30], \u0026apos;max_features\u0026apos;: [\u0026apos;auto\u0026apos;, \u0026apos;sqrt\u0026apos;, \u0026apos;log2\u0026apos;], \u0026apos;min_samples_split\u0026apos;: [\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e10\u003c/span\u003e], \u0026apos;min_samples_leaf\u0026apos;: [\u003cspan class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e], \u0026apos;class_weight\u0026apos;: [\u0026apos;balanced\u0026apos;, None]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAdaboosting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;n_estimators\u0026apos;: [50, 100, 200], \u0026apos;learning_rate\u0026apos;: [0.01, 0.1, 1.0], \u0026apos;algorithm\u0026apos;: [\u0026apos;SAMME\u0026apos;, \u0026apos;SAMME.R\u0026apos;], \u0026apos;random_state\u0026apos; : [42]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eK-Nearest Neighbor\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;n_neighbors\u0026apos;: [\u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e9\u003c/span\u003e], \u0026apos;weights\u0026apos;: [\u0026apos;uniform\u0026apos;, \u0026apos;distance\u0026apos;], \u0026apos;metric\u0026apos;: [\u0026apos;euclidean\u0026apos;, \u0026apos;manhattan\u0026apos;, \u0026apos;minkowski\u0026apos;,\u0026apos;Hamming\u0026apos;]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;n_estimators\u0026apos;: [100, 200, 300], \u0026apos;learning_rate\u0026apos;: [0.01, 0.05, 0.1, 0.2], \u0026apos;max_depth\u0026apos;: [\u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e6\u003c/span\u003e], \u0026apos;min_child_weight\u0026apos;: [\u003cspan class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003e], \u0026apos;subsample\u0026apos;: [0.8, 0.9, 1.0], \u0026apos;colsample_bytree\u0026apos;: [0.7, 0.8, 0.9]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGradient Boosting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u0026apos;n_estimators\u0026apos;: [100, 200, 300, 400, 500], \u0026apos;learning_rate\u0026apos;: [0.01, 0.05, 0.1, 0.2, 0.3], \u0026apos;max_depth\u0026apos;: randint(3, 10), \u0026apos;min_samples_split\u0026apos;: randint(2, 20), \u0026apos;min_samples_leaf\u0026apos;: randint(1, 20), \u0026apos;subsample\u0026apos;: [0.8, 0.9, 1.0], \u0026apos;max_features\u0026apos;: [\u0026apos;auto\u0026apos;, \u0026apos;sqrt\u0026apos;, None]\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u0026nbsp;Also, in the Stacking model, we used Random Forest, Support Vector Machine, Adaboosting, K-Nearest Neighbor, XGBoost models as base_learners. And used Linear Regression model as final_estimator.\u003c/p\u003e\n\u003cp\u003eTo assess computational efficiency, the total training time for each model\u0026apos;s hyperparameter tuning process was recorded using Time Profiling techniques. This helped to evaluate not only model accuracy but also training cost in terms of execution time, which is an essential consideration in real-world applications.\u003c/p\u003e\n\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u0026nbsp;\u003c/div\u003e"},{"header":"Experimental Results and Discussions","content":"\u003cp\u003eIn order to evaluate the effectiveness of multiple machine learning models on an unbalanced dataset, both with and without the use of the Synthetic Minority Over-sampling Technique (SMOTE), this chapter describes and analyzes the experimental findings of the proposed study. This study's primary goal is to assess how SMOTE affects model performance in terms of accuracy and execution time. It also examines how feature selection and hyperparameter optimization can improve classification outcomes. This study compares various algorithms under controlled experimental conditions, offering insights into the impact of data balancing and optimization strategies on predictive performance, essential for creating robust and reliable machine learning solutions for real-world imbalanced classification issues. Figure\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, graphically represent the proposed methods of the experiments.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003ch2\u003eEnvironmental Setup\u003c/h2\u003e\u003cp\u003eThe experiments included in this study were performed on a desktop computer featuring an AMD Ryzen 5 5600G processor (6 cores, 12 threads), 32 GB of RAM, and a 250 GB SSD for quick information retrieval. The implementation was executed using the Python programming language, with development conducted in Jupyter Notebook within the Visual Studio Code (VS Code) environment.\u003c/p\u003e\u003cp\u003eA variety of crucial Python libraries and modules have been implemented to facilitate handling data, preprocessing, development of models, and performance measurement. The tools encompassed pandas and numpy for data manipulation, matplotlib and seaborn for visualization, and scikit-learn for machine learning tasks including model training, evaluation, and hyperparameter optimization. Furthermore, RandomizedSearchCV was used for hyperparameter optimization, and BorutaPy was used for feature selection. By speeding up certain model computations, the scikit-learn-intelex (imported via sklearnex) patch was created to increase performance. To assess training effectiveness, Python's time package was used to enable timing during each trial.\u003c/p\u003e\u003ch2\u003eOptimal Hyperparameters\u003c/h2\u003e\u003cp\u003eAs mentioned in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e2\u003c/span\u003e, the initial parameters were not necessarily the best for each model. Below in Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e3\u003c/span\u003e, is the list of models along with their optimal hyperparameters, identified separately for both with and without SMOTE conditions.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eOptimal Hyperparameters with and without SMOTE of each Model\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eOptimal Hyperparameters (Without SMOTE)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eOptimal Hyperparameters (With SMOTE)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'bootstrap':True,'max_features': 'sqrt', 'min_samples_leaf': 12, 'min_samples_split':8, 'n_estimators': 500\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'bootstrap':False,'max_features': 'sqrt', 'min_samples_leaf': 3, 'min_samples_split':6,'n_estimators': 600\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSupport Vector Machine\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'kernel': 'sigmoid', 'gamma': 0.1, 'C': 0.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'kernel': 'rbf', 'gamma': 0.1, 'C': 10\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGaussian Naive Bayes\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'var_smoothing': 1e-05, 'priors': None\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'var_smoothing':1e-05,'priors': None\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLogistic Regression\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'warm_start': False, 'solver': 'lbfgs', 'penalty': None, 'multi_class': 'ovr', 'max_iter': 200, 'class_weight': None, 'C': 1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'warm_start': False, 'solver': 'lbfgs', 'penalty': None, 'multi_class': 'multinomial', 'max_iter': 300, 'class_weight': 'balanced', 'C': 1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'splitter':'random','min_samples_split': 2,'min_samples_leaf':2,'max_features': 'log2', 'max_depth': 10, 'criterion': 'entropy', 'class_weight': None\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'splitter': 'best', 'min_samples_split': 6, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': 30, 'criterion': 'entropy', 'class_weight': 'balanced'\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAdaboosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'random_state':42,'n_estimators': 50,'learning_rate':0.01,'algorithm': 'SAMME'\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'random_state': 42, 'n_estimators': 200, 'learning_rate': 1.0, 'algorithm': 'SAMME'\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eK-Nearest Neighbor\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'weights': 'uniform', 'n_neighbors': 9, 'metric': 'manhattan'\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'weights': 'uniform', 'n_neighbors': 3, 'metric': 'manhattan'\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eXGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'subsample': 0.8, 'n_estimators': 100, 'min_child_weight': 2, 'max_depth': 6, 'learning_rate':0.05,'colsample_bytree': 0.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'subsample': 0.8, 'n_estimators': 300, 'min_child_weight': 1, 'max_depth': 6, 'learning_rate': 0.2, 'colsample_bytree': 0.8\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e'learning_rate': 0.01, 'max_depth': 6, 'max_features':None,'min_samples_leaf': 1, 'min_samples_split': 6, 'n_estimators': 400, 'subsample': 0.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e'learning_rate': 0.2, 'max_depth': 9, 'max_features':None,'min_samples_leaf': 17, 'min_samples_split': 7, 'n_estimators': 500, 'subsample': 0.9\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003cp\u003e\u003c/p\u003e\u003ch2\u003eResults Without SMOTE\u003c/h2\u003e\u003cp\u003eIn this part, we report the performance metrics of all selected models after training on the original imbalanced dataset. Each model was evaluated according to its accuracy, precision, recall, and F1-score. The objective is to assess the models' performance on the unprocessed data prior to implementing any resampling or balancing techniques. The outcomes shown in Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e4\u003c/span\u003e come from the best combos of hyperparameters that were found by Randomized Search optimization.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerformance Metrics of Models Without SMOTE\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"8\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eModel Name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003cp\u003e(%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e\u003cp\u003eRecall\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c8\" namest=\"c7\"\u003e\u003cp\u003eF1-Score\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e92.03\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.56\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.04\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSupport Vector Machine\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e91.99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGaussian Naive Bayes\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e85.19\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.94\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.23\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.89\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.36\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.28\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLogistic Regression\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e91.99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e91.94\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.39\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.02\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAdaboosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e91.99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.00\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eK-Nearest Neighbor\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e91.85\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.32\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.02\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.03\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eXGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e92.04\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.55\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.03\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.05\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e92.01\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.52\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e1.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.03\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.05\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStacking\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e91.86\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.93\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.46\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.16\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAlthough the majority of models attain a high overall accuracy (~ 92%), their efficacy in identifying the minority class (1) is markedly inadequate, as evidenced by diminished precision, recall, and F1-score metrics. This is a characteristic indication of imbalanced data, when models exhibit bias towards the dominant class. The Random Forest model had slightly superior performance (92.03% accuracy) in identifying class 1 relative to other classes, although it still lacks balanced efficacy. This necessitates the subsequent implementation of SMOTE to equilibrate the dataset and enhance minority class identification.\u003c/p\u003e\u003ch2\u003eResults With SMOTE\u003c/h2\u003e\u003cp\u003eSubsequent to the application of the SMOTE strategy to equilibrate the class distribution within the dataset, each machine learning model was re-trained and assessed utilizing the same experimental framework. The principal objective of employing SMOTE was to enhance the detection efficacy of minority class instances, which were markedly underrepresented in the initial dataset. The subsequent Table\u0026nbsp;\u003cspan refid=\"Tab6\" class=\"InternalRef\"\u003e5\u003c/span\u003e encapsulates the performance metrics—accuracy, precision, recall, and F1-score—for both classes (0 and 1) across all models trained with SMOTE.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab6\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 5\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerformance Metrics of Models With SMOTE\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"8\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eModel Name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003cp\u003e(%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c4\" namest=\"c3\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c6\" namest=\"c5\"\u003e\u003cp\u003eRecall\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colspan=\"2\" nameend=\"c8\" namest=\"c7\"\u003e\u003cp\u003eF1-Score\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e89.50\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.91\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.88\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.88\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.91\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.88\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSupport Vector Machine\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e59.30\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.58\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.98\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.11\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.73\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.19\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGaussian Naive Bayes\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e75.21\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.71\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.77\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.73\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLogistic Regression\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e74.24\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.69\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.72\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.77\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.76\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.73\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e83.59\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.88\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.86\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.85\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAdaboosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e80.49\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.83\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.78\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.79\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.78\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eK-Nearest Neighbor\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e89.38\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.81\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.82\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.99\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.89\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eXGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e88.00\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.89\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.87\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.95\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.89\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.86\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e93.23\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.94\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.90\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.94\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.92\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStacking\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e94.49\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e0.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e0.93\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e0.94\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e0.95\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.95\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0.94\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe results indicate that the application of SMOTE significantly enhanced the precision, recall, and F1-score metrics for class 1 in the majority of models. Models like as Random Forest, Gradient Boosting, and Stacking attained equitable precision and recall metrics for both classes, signifying strong performance on the balanced dataset. The Stacking model far surpassed all others [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e], [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e], achieving an F1-score of 0.94 for the minority class and an overall accuracy of 94.49%. Likewise, Gradient Boosting demonstrated robust performance across all criteria. In contrast to results obtained without SMOTE, these findings validate that class balancing enhanced the models' generalization, especially in recognizing minority class instances. Nonetheless, SVM inadequately identified class 1, suggesting that SMOTE alone may not suffice for specific algorithms without additional tuning or feature optimization.\u003c/p\u003e\u003ch2\u003eComparative Analysis: With vs Without SMOTE\u003c/h2\u003e\u003cp\u003eThe primary objective is to evaluate how SMOTE affects the ability of each model to detect the minority class by comparing metrics such as accuracy, precision, recall, and F1-score before and after SMOTE was applied. Additionally, this section discusses trade-offs in performance and time cost due to class balancing.\u003c/p\u003e\u003cp\u003eAfter using SMOTE, there is a clear trend across practically all models: the recall and F1-score for class 1 (the minority class) got a significant improvement depicts in Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e. With SMOTE, Stacking had the highest F1-score (0.94) and accuracy (94.49%) of all the models, including Gradient Boosting, XGBoost, K-Nearest Neighbor, and Stacking. Before SMOTE, models like Random Forest and KNN had good accuracy but low recall and F1-score for class 1, indicating bias toward the majority class. These models become more balanced using SMOTE, enhancing recall without compromising precision.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e presents a side by side model comparison of the classification accuracies trained with and without SMOTE. It clearly showing that SMOTE impacted the accuracies on each model. While some models, like Support Vector Machine and Logistic Regression, show a significant drop in accuracy after applying SMOTE, others such as Gradient Boosting and Stacking maintain high performance or even show improvement.\u003c/p\u003e\u003cp\u003eThe Support Vector Machine continued to perform poorly even after SMOTE, as evidenced by its low recall for class 1. This suggests that SMOTE may not be sufficient for certain linear models or may necessitate additional refining. While ensemble-based models benefited the most from SMOTE.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003ch2\u003eExecution Time Comparison\u003c/h2\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab7\" class=\"InternalRef\"\u003e6\u003c/span\u003e shows the training durations for each model, with and without SMOTE. Applying SMOTE generally increased training time due to the added overhead of generating synthetic samples, with complex models like Support Vector Machine, Random Forest, and Stacking showing the largest increases.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cdiv class=\"gridtable\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003ctable float=\"Yes\" id=\"Tab7\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 6\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eExecution Time Comparison of Machine Learning Models\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003c/colgroup\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel Name\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eWithout SMOTE (sec)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eWith SMOTE (sec)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e767.63\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1650.69\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSupport Vector Machine\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e3035.50\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e6455.76\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGaussian Naive Bayes\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e6.78\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e10.31\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLogistic Regression\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e1088.55\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e2336.80\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e45.15\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e98.28\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAdaboosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e464.39\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1022.00\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eK-Nearest Neighbor\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e606.35\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1588.47\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eXGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e210.78\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e347.63\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e2348.31\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e23485.31\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStacking\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e175.85\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e1127.75\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study aims to create a model that will predict an early stage of cardiovascular disease depending on some predefined parameters. This process works with 308,070 patients' information done with data cleaning and encoding processes. The Boruta algorithm was applied to find the importance for this study as to which feature is more impactful. Data imbalance plays a pivotal role in model accuracies, and to balance data, we applied the SMOTE method. Total models were trained both using and without using SMOTE. And we find a spectacular distinction between these two processes. When the process was trained without SMOTE, we found a significantly low amount of precision, recall, and F1-score for some models. Though the accuracies were higher, they are not well justified, as the accuracy metrics aren\u0026rsquo;t sufficient enough. For the majority class, each model was showing nearly 1 value. But for the minority class, the precision, recall, and F1-score are (0.56, 0.02, 0.04), (0.23, 0.36, 0.28), (0.32, 0.03, 0.02), (0.39, 0.01, 0.02), (0.52, 0.03, 0.05), and (0.52, 0.01, 0.02), respectively for Random Forest, Gaussian Naive Bayes, Xtreme Gradient Boosting, Decision Tree, Gradient Boosting, and Stacking. And most surprisingly (0.00, 0.00, 0.00) for SVM, LR and Adaboosting. Having these malicious or faulty metrics XGB, RF, GB got 92.04%, 92.03%, 92.01% accuracies and except GNB model rest of the models got above 91% accuracies. On the other side, after solving the data imbalance problem applying SMOTE, we noticed a significant improves almost in every model. Though the model accuracy decreased acutely but the accuracy metrics like precision, recall and F1-score climbed for minority class (0.88, 0.88, 0.88), (0.71, 075, 0.73), (0.91, 0.87, 0.89), (0.79, 0.85, 0.82), (0.94, 0.90, 0.92), (0.92, 0.94, 0.93), (0.82, 0.11, 0.19), (0.69, 0.77, 0.73), (0.79, 0.80, 0.80) respectively for Random Forest, Gaussian naive bayes, Xtreme gradient boosting, Decision Tree, gradient boosting, Stacking, SVM, LR, Adaboosting. In the case of using SMOTE, Stacking overweight all others models and achieved highest accuracy getting 94.00%. GB, RF, KNN, XGB also achieved a good amount of accuracies such as 93.23%, 89.50%, 89.38% and 88% respectively. The study utilized a single dataset source (BRFSS), potentially limiting generalizability across different populations and healthcare systems. The increased computational overhead associated with SMOTE implementation, particularly evident in complex models like Gradient Boosting (10x training time increase), may pose practical deployment challenges in resource-constrained environments. Additionally, while SMOTE effectively addressed class imbalance, alternative resampling techniques were not explored. Future research should focus on: Cross-dataset validation across multiple international cardiovascular datasets to establish generalizability; Advanced resampling techniques investigation, including ADASYN and Borderline-SMOTE; Deep learning integration with SMOTE for enhanced feature representation; Real-time deployment studies in clinical decision support systems; Explainable AI development for interpretable healthcare predictions; and Cost-sensitive learning exploration as alternatives to resampling approaches. These directions will advance the field toward more robust, clinically-applicable cardiovascular risk assessment systems.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eConceptualization: Md. Saon Sikder; Methodology: Md. Saon Sikder; Formal analysis and investigation: Engr. Md. Emad Uddin Aksir; Project administration: Engr. Md. Emad Uddin Aksir; Validation: Md. Saon Sikder, Engr. Md. Emad Uddin Aksir; Visualization: Md. Saon Sikder, Engr. Md. Emad Uddin Aksir; Writing - original draft preparation: Md. Saon Sikder; Writing - review and editing: Engr. Md. Emad Uddin Aksir\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003e\"Behavioral Risk Factor Surveillance System (BRFSS)\" through Kaggle website that contains the dataset published in 2021: https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003e\u0026ldquo;Cardiovascular diseases (CVDs).\u0026rdquo; Accessed: Jun. 09, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds\u003c/span\u003e\u003cspan address=\"https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e\u0026ldquo;1.\u0026rdquo; Accessed: Jun. 22, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://my.clevelandclinic.org/health/diseases/21493-cardiovascular-disease\u003c/span\u003e\u003cspan address=\"https://my.clevelandclinic.org/health/diseases/21493-cardiovascular-disease\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eS. K. P. -, \u0026ldquo;Real-time Analytics and Clinical Decision Support Systems: Transforming Emergency Care,\u0026rdquo; \u003cem\u003eInt. J. Multidiscip. Res.\u003c/em\u003e, vol. 6, no. 6, Nov. 2024, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.36948/IJFMR.2024.V06I06.31500\u003c/span\u003e\u003cspan address=\"10.36948/IJFMR.2024.V06I06.31500\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eR. T. Sutton, D. Pincock, D. C. Baumgart, D. C. Sadowski, R. N. Fedorak, and K. I. Kroeker, \u0026ldquo;An overview of clinical decision support systems: benefits, risks, and strategies for success,\u0026rdquo; \u003cem\u003eNPJ Digit. Med.\u003c/em\u003e, vol. 3, no. 1, p. 17, Dec. 2020, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/S41746-020-0221-Y\u003c/span\u003e\u003cspan address=\"10.1038/S41746-020-0221-Y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e\u0026ldquo;benefit2.\u0026rdquo; Accessed: Jul. 01, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.narayanahealth.org/blog/benefits-of-early-heart-disease-detection\u003c/span\u003e\u003cspan address=\"https://www.narayanahealth.org/blog/benefits-of-early-heart-disease-detection\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eB. Ristevski and M. Chen, \u0026ldquo;Big Data Analytics in Medicine and Healthcare,\u0026rdquo; \u003cem\u003eJ. Integr. Bioinform.\u003c/em\u003e, vol. 15, no. 3, p. 20170030, May 2018, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1515/JIB-2017-0030/PDF\u003c/span\u003e\u003cspan address=\"10.1515/JIB-2017-0030/PDF\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eK. Batko and A. Ślęzak, \u0026ldquo;The use of Big Data Analytics in healthcare,\u0026rdquo; \u003cem\u003eJ. Big Data\u003c/em\u003e, vol. 9, no. 1, pp. 1\u0026ndash;24, Dec. 2022, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/S40537-021-00553-4/TABLES/11\u003c/span\u003e\u003cspan address=\"10.1186/S40537-021-00553-4/TABLES/11\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eK. P. Kresoja, M. Unterhuber, R. Wachter, H. Thiele, and P. Lurz, \u0026ldquo;A cardiologist\u0026rsquo;s guide to machine learning in cardiovascular disease prognosis prediction,\u0026rdquo; \u003cem\u003eBasic Res. Cardiol.\u003c/em\u003e, vol. 118, no. 1, Dec. 2023, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/S00395-023-00982-7\u003c/span\u003e\u003cspan address=\"10.1007/S00395-023-00982-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e,.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eS. A. J. Zaidi, A. Ghafoor, J. Kim, Z. Abbas, and S. W. Lee, \u0026ldquo;HeartEnsembleNet: An Innovative Hybrid Ensemble Learning Approach for Cardiovascular Risk Prediction,\u0026rdquo; \u003cem\u003eHealthc. 2025, Vol. 13, Page 507\u003c/em\u003e, vol. 13, no. 5, p. 507, Feb. 2025, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3390/HEALTHCARE13050507\u003c/span\u003e\u003cspan address=\"10.3390/HEALTHCARE13050507\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eA. Alqahtani, S. Alsubai, M. Sha, L. Vilcekova, and T. Javed, \u0026ldquo;Cardiovascular Disease Detection using Ensemble Learning,\u0026rdquo; \u003cem\u003eComput. Intell. Neurosci.\u003c/em\u003e, vol. 2022, pp. 1\u0026ndash;9, Aug. 2022, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1155/2022/5267498\u0026rsquo;)\u003c/span\u003e\u003cspan address=\"10.1155/2022/5267498\u0026rsquo;)\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eT. Liu, A. Krentz, L. Lu, and V. Curcin, \u0026ldquo;Machine learning based prediction models for cardiovascular disease risk using electronic health records data: systematic review and meta-analysis,\u0026rdquo; \u003cem\u003eEur. Hear. J. - Digit. Heal.\u003c/em\u003e, vol. 6, no. 1, pp. 7\u0026ndash;22, Jan. 2025, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/EHJDH/ZTAE080\u003c/span\u003e\u003cspan address=\"10.1093/EHJDH/ZTAE080\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eM. Chavan, S. K. Singh, S. Bansod, and P. Pal, \u0026ldquo;Design and Implementation of Heart Disease Prediction Using Artificial Neural Network,\u0026rdquo; \u003cem\u003eProc. 8th IEEE Int. Conf. Sci. Technol. Eng. Math. ICONSTEM 2023\u003c/em\u003e, 2023, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICONSTEM56934.2023.10142267\u003c/span\u003e\u003cspan address=\"10.1109/ICONSTEM56934.2023.10142267\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eS. P. Patro, N. Padhy, and D. Chiranjevi, \u0026ldquo;Ambient assisted living predictive model for cardiovascular disease prediction using supervised learning,\u0026rdquo; \u003cem\u003eEvol. Intell.\u003c/em\u003e, vol. 14, no. 2, pp. 941\u0026ndash;969, Jun. 2021, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/S12065-020-00484-8/METRICS\u003c/span\u003e\u003cspan address=\"10.1007/S12065-020-00484-8/METRICS\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eA. Singh and R. Kumar, \u0026ldquo;Heart Disease Prediction Using Machine Learning Algorithms,\u0026rdquo; \u003cem\u003eInt. Conf. Electr. Electron. Eng. ICE3 2020\u003c/em\u003e, pp. 452\u0026ndash;457, Feb. 2020, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICE348803.2020.9122958\u003c/span\u003e\u003cspan address=\"10.1109/ICE348803.2020.9122958\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eK. S. L. Prasanna, N. P. Challa, and J. Nagaraju, \u0026ldquo;Heart Disease Prediction using Reinforcement Learning Technique,\u0026rdquo; 2023 \u003cem\u003e3rd Int. Conf. Adv. Electr. Comput. Commun. Sustain. Technol. ICAECT 2023\u003c/em\u003e, 2023, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICAECT57570.2023.10118232\u003c/span\u003e\u003cspan address=\"10.1109/ICAECT57570.2023.10118232\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eH. V. Ramesh and R. K. Pathinarupothi, \u0026ldquo;Performance Analysis of Machine Learning Algorithms to Predict Cardiovascular Disease,\u0026rdquo; 2023 \u003cem\u003eIEEE 8th Int. Conf. Converg. Technol. I2CT 2023\u003c/em\u003e, 2023, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/I2CT57861.2023.10126428\u003c/span\u003e\u003cspan address=\"10.1109/I2CT57861.2023.10126428\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eK. K. Gola and S. Arya, \u0026ldquo;Satin Bowerbird Optimization-Based Classification Model for Heart Disease Prediction Using Deep Learning in E-Healthcare,\u0026rdquo; \u003cem\u003eProc. \u0026ndash;\u0026thinsp;23rd IEEE/ACM Int. Symp. Clust. Cloud Internet Comput. Work. CCGridW\u003c/em\u003e 2023, pp. 296\u0026ndash;298, 2023, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/CCGRIDW59191.2023.00063\u003c/span\u003e\u003cspan address=\"10.1109/CCGRIDW59191.2023.00063\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eS. H. Bani Hani and M. M. Ahmad, \u0026ldquo;Machine-learning Algorithms for Ischemic Heart Disease Prediction: A Systematic Review,\u0026rdquo; \u003cem\u003eCurr. Cardiol. Rev.\u003c/em\u003e, vol. 19, no. 1, pp. 87\u0026ndash;99, Jun. 2022, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.2174/1573403X18666220609123053/CITE/REFWORKS\u003c/span\u003e\u003cspan address=\"10.2174/1573403X18666220609123053/CITE/REFWORKS\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003e\u0026ldquo;Cardiovascular Diseases Risk Prediction Dataset.\u0026rdquo; Accessed: Jul. 30, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset\u003c/span\u003e\u003cspan address=\"https://www.kaggle.com/datasets/alphiree/cardiovascular-diseases-risk-prediction-dataset\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eN. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, \u0026ldquo;SMOTE: Synthetic minority over-sampling technique,\u0026rdquo; \u003cem\u003eJ. Artif. Intell. Res.\u003c/em\u003e, vol. 16, pp. 321\u0026ndash;357, 2002, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1613/JAIR.953\u003c/span\u003e\u003cspan address=\"10.1613/JAIR.953\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJ. Huang, N. Huang, L. Zhang, and H. Xu, \u0026ldquo;A method for feature selection based on the correlation analysis,\u0026rdquo; \u003cem\u003eProc. 2012 Int. Conf. Meas. Inf. Control. MIC\u003c/em\u003e 2012, vol. 1, pp. 529\u0026ndash;532, 2012, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/MIC.2012.6273357\u003c/span\u003e\u003cspan address=\"10.1109/MIC.2012.6273357\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJ. Bergstra, J. B. Ca, and Y. B. Ca, \u0026ldquo;Random Search for Hyper-Parameter Optimization,\u0026rdquo; \u003cem\u003eJ. Mach. Learn. Res.\u003c/em\u003e, vol. 13, no. 10, pp. 281\u0026ndash;305, 2012, Accessed: Aug. 05, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://jmlr.org/papers/v13/bergstra12a.html\u003c/span\u003e\u003cspan address=\"http://jmlr.org/papers/v13/bergstra12a.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eB. Naderalvojoud and T. Hernandez-Boussard, \u0026ldquo;Improving machine learning with ensemble learning on observational healthcare data,\u0026rdquo; \u003cem\u003eAMIA Annu. Symp. Proc.\u003c/em\u003e, vol. 2023, p. 521, 2024, Accessed: Aug. 16, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://pmc.ncbi.nlm.nih.gov/articles/PMC10785929/\u003c/span\u003e\u003cspan address=\"https://pmc.ncbi.nlm.nih.gov/articles/PMC10785929/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eM. S. H. Rabbi \u003cem\u003eet al.\u003c/em\u003e, \u0026ldquo;Performance evaluation of optimal ensemble learning approaches with PCA and LDA-based feature extraction for heart disease prediction,\u0026rdquo; \u003cem\u003eBiomed. Signal Process. Control\u003c/em\u003e, vol. 101, p. 107138, Mar. 2025, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/J.BSPC.2024.107138\u003c/span\u003e\u003cspan address=\"10.1016/J.BSPC.2024.107138\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eP. S. Mung and S. Phyu, \u0026ldquo;Ensemble learning method for enhancing healthcare classification,\u0026rdquo; \u003cem\u003eWCSE 2020 2020 10th Int. Work. Comput. Sci. Eng.\u003c/em\u003e, pp. 652\u0026ndash;656, 2020, doi: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.18178/WCSE.2020.02.024\u003c/span\u003e\u003cspan address=\"10.18178/WCSE.2020.02.024\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"cardiovascular disease prediction, machine learning, SMOTE, class imbalance, ensemble methods, healthcare analytics","lastPublishedDoi":"10.21203/rs.3.rs-7428299/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7428299/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eCardiovascular disease (CVD) constitutes the primary global mortality cause, affecting 18\u0026nbsp;million individuals annually. Machine learning approaches for CVD prediction face significant challenges due to inherent class imbalance in healthcare datasets, where disease-positive cases are substantially underrepresented, leading to biased model performance favoring majority classes. This comprehensive study evaluated ten machine learning algorithms including Random Forest, Support Vector Machine, XGBoost, and ensemble methods on the Behavioral Risk Factor Surveillance System (BRFSS) dataset containing 308,070 patient records. The Boruta algorithm identified optimal feature subsets, while RandomizedSearchCV performed hyperparameter optimization. Model performance was assessed both on original imbalanced data and after applying Synthetic Minority Over-sampling Technique (SMOTE) for class balancing. Original imbalanced datasets yielded high overall accuracies (~\u0026thinsp;92%) but severely compromised minority class detection (F1-scores: 0.00-0.28). SMOTE implementation dramatically enhanced minority class performance: Stacking ensemble achieved optimal results with 94.49% accuracy and 0.94 F1-score for CVD-positive cases. Ensemble methods demonstrated superior adaptability to class balancing compared to linear algorithms, which showed substantial performance degradation. SMOTE effectively mitigates class imbalance challenges in cardiovascular disease prediction, significantly improving minority class detection capabilities while preserving overall model accuracy, establishing ensemble methods as optimal approaches for imbalanced healthcare applications.\u003c/p\u003e","manuscriptTitle":"Machine Learning-Based Cardiovascular Disease Prediction: Comparative Analysis of SMOTE Impact on Imbalanced Healthcare Data","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-27 07:08:26","doi":"10.21203/rs.3.rs-7428299/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"ef919dec-545f-4e4b-a559-93060b2aa225","owner":[],"postedDate":"August 27th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-09-02T14:08:57+00:00","versionOfRecord":[],"versionCreatedAt":"2025-08-27 07:08:26","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7428299","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7428299","identity":"rs-7428299","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-24T02:00:01.246996+00:00
License: CC-BY-4.0