Predicting Diabetes in Adults: Identifying Important Features in Unbalanced Data Over a 5-Year Cohort Study Using Machine Learning Algorithm

doi:10.21203/rs.3.rs-4772777/v1

Predicting Diabetes in Adults: Identifying Important Features in Unbalanced Data Over a 5-Year Cohort Study Using Machine Learning Algorithm

2024 · doi:10.21203/rs.3.rs-4772777/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 155,053 characters · extracted from preprint-html · click to expand

Predicting Diabetes in Adults: Identifying Important Features in Unbalanced Data Over a 5-Year Cohort Study Using Machine Learning Algorithm | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Predicting Diabetes in Adults: Identifying Important Features in Unbalanced Data Over a 5-Year Cohort Study Using Machine Learning Algorithm Maryam Talebi Moghaddam, Yones Jahani, Zahra Arefzadeh, Azizallah Dehghan, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4772777/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 27 Sep, 2024 Read the published version in BMC Medical Research Methodology → Version 1 posted 15 You are reading this latest preprint version Abstract Background Imbalanced datasets pose significant challenges in predictive modeling, leading to biased outcomes and reduced model reliability. This study addresses data imbalance in diabetes prediction using machine learning techniques. Utilizing data from the Fasa Adult Cohort Study (FACS) with a 5-year follow-up of 10,000 participants, we developed predictive models for Type 2 diabetes. Methods We employed various data-level and algorithm-level interventions, including SMOTE, ADASYN, SMOTEENN and KMeans SMOTE, paired with Random Forest, Gradient Boosting, and Multi-Layer Perceptron (MLP). Performance was evaluated using F1 score, AUC, and G-means. Results Our results show that ADASYN with MLP achieved an F1 score of 82.17 ± 3.38, AUC of 89.61 ± 2.09, and G-means of 89.15 ± 2.31. SMOTE with MLP followed closely with an F1 score of 79.85 ± 3.91, AUC of 89.7 ± 2.54, and G-means of 89.31 ± 2.78. The SMOTEENN with Random Forest combination achieved an F1 score of 78.27 ± 1.54, AUC of 87.18 ± 1.12, and G-means of 86.47 ± 1.28. Conclusion These combinations effectively address class imbalance, improving the accuracy and reliability of diabetes predictions. The findings highlight the importance of using appropriate data-balancing techniques in medical data analysis. Imbalanced datasets Diabetes prediction Machine learning Artificial intelligence Data-level method Algorithm-level method Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 Figure 11 1. Introduction Type 2 diabetes, is a chronic metabolic disorder characterized by insulin resistance or insufficient insulin production, that significantly contributes to the global burden of disease ( 1 ). It is associated with severe complications, including heart disease, stroke, kidney failure, blindness, and lower-limb amputation. It has been linked to increased risks of dementia, hearing loss, and certain cancers, thereby heightening the risk of premature mortality ( 2 , 3 ). The incidence of diabetes is alarmingly on the rise. According to the International Diabetes Federation, the global diabetes population was 382 million in 2013, anticipated to surge to 592 million by 2035 ( 4 ). Similarly, a study highlighted that the prevalence among adults was 6.4% in 2010, affecting 285 million adults, with projections indicating an increase to 7.7%, affecting 439 million adults by 2030. This escalating trend underscores the critical need for robust predictive tools to effectively manage and mitigate the disease's impact. ( 5 ) In response, machine learning (ML) techniques are increasingly leveraged to forecast the onset of diabetes and its complications ( 6 ). These methods have shown considerable efficacy in enhancing risk prediction, prognosis, treatment, and management strategies ( 7 , 8 ). Popular ML models used in diabetes prediction include Random Forest, K-NN, neural networks, support vector machines, decision trees, and extra trees ( 7 , 9 ). A notable challenge in this domain is the prevalence of data imbalance in clinical datasets, which typically include variables like blood sugar and blood pressure ( 10 ). Such imbalance can drastically affect the performance of predictive models, often resulting in biased outcomes and less reliable predictions. This is particularly critical in forecasting the incidence of type 2 diabetes, where the accuracy of predictions can significantly influence preventive and therapeutic measures ( 11 ). To tackle these challenges, our study adopts both data-level and algorithm-level interventions ( 11 – 14 ). We explore oversampling, under sampling, and hybrid sampling techniques to correct data imbalances ( 15 ). Additionally, we utilize a range of ML algorithms, evaluating their effectiveness through metrics such as the F1 score, AUC, and G-means indices to identify the most proficient approaches in predicting the incidence of diabetes ( 16 ). 2. Related Work Recent studies have addressed the challenges of applying machine learning algorithms to imbalanced datasets, particularly in the prediction of diabetes. These efforts are marked by the development of various resampling methods and algorithmic adjustments to improve predictive performance. In 2022, Somieh et al. utilized Deep Neural Network (DNN), Extreme Gradient Boosting (XGBoost), and Random Forest (RF) algorithms on the Tehran Lipid and Glucose Study (TLGS) cohort data, which was notably imbalanced. Their findings highlighted that undersampling methods yielded superior results compared to other techniques in managing data skewness ( 17 ). Masoud Mohammad Hassan et al. (2019) explored six different algorithms—logistic regression, decision tree, K-Nearest Neighbors, Naive Bayes, Support Vector Machine, and artificial neural networks—on the Pima dataset. They implemented the Synthetic Minority Over-sampling Technique (SMOTE) to address data imbalance. The study concluded that SVM benefited most significantly from this resampling method, demonstrating enhanced model performance ( 18 ). MATLOOB KHUSHI et al. (2021) conducted a study using lung cancer datasets, PLCO and NLST, both characterized by imbalance. They employed 23 resampling models alongside hybrid systems, using logistic regression, Random Forest, and LinearSVC to determine the most effective forecasting model. Their results indicated that under-sampling techniques generally exhibited higher standard deviations, while over-sampling resulted in lower variances. Random Forest was identified as having the best predictive ability for the lung cancer datasets used ( 19 ). M. Sandeep Kumar et al. (2022) used six algorithms including k-nearest Neighbor, Naive Bayes, Support Vector Machines, Random Forest, Logistic Regression, and Decision Trees on the PIMA Indian dataset. They tested both oversampling and undersampling methods. Their results suggested that SVM outperformed other models in dealing with imbalanced data ( 20 ). A significant study in 2024 by O. Olawale Awe et al. investigated the use of various resampling algorithms—including random oversampling, SMOTE, ADASYN, random subsampling, Tomek linkages, NearMiss, and others—across four imbalanced datasets related to diabetes, anemia, lung cancer, and obesity. They found that the Repeated Nearest Neighbor Sampling method (RENN) combined with logistic regression achieved the most substantial improvement in predictions ( 21 ). In 2023, Wahyu Nugraha et al. focused on the Pima Indians dataset, employing the SMOTE + Tomek link method along with a decision tree classification algorithm. Their experimental results showed that this combination performed better than using SMOTE without Tomek links ( 22 ). Also in 2023, Hirani Hairani and Dadang Priyanto studied the same Pima Indian dataset using SVM and Random Forest with SMOTE-ENN. They concluded that the Random Forest method with SMOTE-ENN outperformed the SVM method ( 23 ). These studies collectively illustrate a diverse range of strategies for addressing data imbalance in diabetes prediction, highlighting the effectiveness of different resampling techniques and machine learning algorithms. 3. METHODS AND MATERIALS 3.1. Study Population This study utilizes data from the Fasa Adult Cohort Study (FACS), initiated in 2016 with baseline data collection completed in 2018. The FACS monitors individuals annually to observe the occurrence of various health events including cardiovascular diseases, diabetes, and hypertension. To date, a 5-year follow-up has been conducted for all participants, with an initial cohort size of 10,000 individuals. Data were gathered through detailed interviews conducted by trained personnel, using comprehensive questionnaires. These questionnaires covered a wide range of topics such as demographics, nutritional status, physical activity, personal habits, and history of chronic or underlying diseases. Additionally, participants underwent blood, urine, and stool tests, along with anthropometric assessments. More comprehensive details about the FACS methodology are available in the study’s protocol and profile publications. ( 24 , 25 ) 3.2. Definition of Variables The primary outcome, or dependent variable, of this study is the 5-year cumulative incidence of Type 2 diabetes. Diagnosis is primarily determined using the hA1C test as recommended by the American Diabetes Association, where a result of 6.5% or higher suggests diabetes. Alternatively, a 2-hour plasma glucose (2-h PG) value of 200 mg/dL or higher during an Oral Glucose Tolerance Test (OGTT) also indicates diabetes ( 26 ). The independent variables include demographic factors such as sex, age, occupation, and education level. Health-related variables include the presence of cardiovascular diseases, smoking status, opium use, and chronic conditions like kidney disorders, fatty liver, and lung disease. Anthropometric measurements taken into account are body mass index (BMI), waist circumference, and weight. Additionally, the study considers Medical Equivalent Task (MET) levels, socioeconomic status, and lipid profiles, including LDL, HDL, total cholesterol, and triglycerides, as potential predictors. This robust methodological framework is designed to accurately capture the complex interplay of various factors contributing to the incidence of Type 2 diabetes, facilitating a comprehensive analysis of risks associated with the disease. 3.3. Data preparation and pre-processing Data preprocessing is an essential phase in machine learning projects, setting the foundation for the effectiveness of the analysis. In this study, the preprocessing stage was meticulously designed to ensure the data's suitability for the applied machine learning models, focusing on handling missing values and data normalization. 3.3.1. Handling Missing Values In dealing with missing values, our approach was to maintain the integrity and quality of the dataset by removing any rows that contained missing data. This decision was based on the premise that the presence of missing values could introduce bias or inaccuracies into our models' predictions. Although this method resulted in a reduction of the dataset size, it ensured that the remaining data was complete, thereby improving the reliability of our analysis. The direct removal of rows with missing data was deemed the most straightforward and effective strategy, considering the dataset's sufficient size and the distribution of missing values across the dataset. 3.3.2. Data Normalization Normalization was the primary step in our data preprocessing, specifically employing min-max scaling. This technique adjusts the values of numeric columns in the dataset to a common scale, between 0 and 1, without distorting differences in the ranges of values or losing information. Min-max scaling is mathematically represented as: $$\:{X}_{norm}=\frac{X-{X}_{min}}{{X}_{max}-{X}_{min}}$$ where $\:X$ is the original value, $\:{X}_{min}$ and $\:{X}_{max}$ are the minimum and maximum values for the feature, respectively, and $\:{X}_{norm}$ is the normalized value. This step is crucial, particularly in our context, where the dataset encompasses a wide range of physiological and clinical measures. Normalizing these values ensures that no single feature disproportionately influences the model due to its scale, facilitating a more balanced and effective analysis. Moreover, min-max scaling aids in accelerating the convergence of gradient descent algorithms by ensuring that the feature space is uniformly scaled ( 27 ). 3.4. Balancing Data through Sampling Methods Sampling methods are essential for addressing data imbalance, ensuring that classification models maintain high accuracy and sensitivity in diabetes detection. The primary goal of these techniques is to rebalance the dataset to allow equitable learning from both classes ( 28 ). 3.5. Overview of Sampling Approaches In the sections that follow, we will delve into various sampling methods, such as oversampling, undersampling, and hybrid approaches. These strategies are critical for mitigating the challenges associated with imbalanced datasets in diabetes classification. 3.5.1. Oversampling Methods Random Over Sampling : This method involves randomly duplicating examples from the minority class to balance the class distribution. Although straightforward, it may result in overfitting by potentially replicating noise in the data ( 29 ). SMOTE (Synthetic Minority Over-sampling Technique) : Developed by Chawla et al. in SMOTE creates synthetic samples through interpolation between multiple minority class samples, helping to avoid overfitting by expanding the decision region for the minority class ( 30 ). ADASYN (Adaptive Synthetic Sampling Approach) : This technique, introduced by He et ADASYN focuses on generating synthetic data for samples that are difficult to classify, enhancing model generalization ( 31 ). Borderline SMOTE : This variant targets minority class samples near the decision boundary, aiming to improve classification on challenging cases by generating synthetic samples along the borderline ( 32 ). KMeans SMOTE : By integrating K-Means clustering with SMOTE, this method clusters the minority class before applying SMOTE within each cluster to produce more contextually relevant synthetic samples ( 33 ). Smotified GAN (SMOTE and GAN) : This innovative approach combines Generative Adversarial Networks (GANs) with SMOTE to produce realistic synthetic samples, enhancing the diversity and authenticity of the data ( 34 ). 3.5.2. Hybrid Sampling Methods Hybrid sampling strategies merge the benefits of both oversampling and undersampling to balance the dataset while minimizing information loss and reducing the risk of overfitting. SMOTEENN (SMOTE + Edited Nearest Neighbors) : This method pairs SMOTE with the undersampling technique Edited Nearest Neighbors (ENN) to refine synthetic samples by removing misclassified instances near their nearest neighbors ( 35 ). SMOTETomek : Similar to SMOTEENN, SMOTETomek combines SMOTE with Tomek links to identify and eliminate close sample pairs from opposing classes, clarifying the decision boundary between classes ( 36 ). 3.6. Classifier Selection Post-Sampling for Diabetes Prediction With the dataset now balanced, we proceed to select suitable machine learning classifiers. As "Table 1 " illustrates, each algorithm has strengths and challenges that, when carefully paired with the newly equilibrated data, can yield a more accurate prediction model for diabetes. The Table 3 delineates the various classifiers considered for this study, post-sampling. It aids in understanding the potential impact of each classifier's advantages and disadvantages within the balanced data context, guiding our selection process towards the most effective algorithm for predicting the incidence of diabetes. Table 1 Illustrates, each algorithm has strengths and challenges. Algorithm Advantages Disadvantages Logistic Regression ( 37 ) Simple to implement and interpret. Efficient to train. Good for binary classification. Assumes linear relationship between variables. Not suitable for complex relationships. K-Nearest Neighbors (KNN) ( 38 ) No assumption about data. Simple and effective. Adaptable to any type of data. Computationally expensive. Performance depends on the number of dimensions. Support Vector Machine (SVM) ( 39 ) Effective in high dimensional spaces. Memory efficient. Versatile with kernel functions. Requires careful parameter tuning. Not suitable for large datasets. Decision Tree Classifier ( 40 ) Easy to interpret and visualize. Can handle both numerical and categorical data. Prone to overfitting. Can become unstable with small variations in data. RandomForestClassifier ( 41 ) Handles overfitting well. Works well on large datasets. Provides feature importances. Can be slow to predict. Complex and difficult to interpret. AdaBoostClassifier ( 42 ) Improves classification accuracy. Flexible to combine with any learning algorithm. Sensitive to noisy data and outliers. Can overfit on very complex datasets. Gradient Boosting ( 43 ) Highly effective and flexible. Can optimize on different loss functions. Prone to overfitting without proper tuning. Time-consuming to train. SGDClassifier ( 44 ) Efficient for large-scale problems. Easy to implement and provides a lot of opportunities for code tuning. Sensitive to feature scaling. Requires a number of hyperparameters. GaussanNB ( 45 ) Works well with high-dimensional data. Simple and fast. Assumes that features are independent. Performance can be affected if the independence assumption is not met. 4. Descriptive results: In total, 7408 people were included with a mean age of 46.55 ± 8.89. 3,806 (%51.38) were male. The 5 years cumulative incidence of type 2 diabetes was (31.8, %95 CI: 27.9–36.1) in 1000 population. The characteristics of the study population are shown in Table 2 . Table 2 Bassline Characteristics of the study population based on gender. Quantitative Variable Subgroup Mean SD P -Value Age Male 47.29 ± 9.12 < 0.001 Female 45.87 ± 8.60 Diastolic blood pressure Male 72.91 ± 10.97 0.001 Female 72.10 ± 10.88 Systolic blood pressure Male 107.65 ± 15.18 < 0.001 Female 106.28 ± 15.19 Pulse rate Male 70.73 ± 10.02 < 0.001 Female 76.06 ± 10.38 Medical Equation Task (MET) Male 45.86 ± 14.42 < 0.001 Female 38.68 ± 6.84 Energy Male 3054.77 ± 1173.33 < 0.001 Female 2862.42 ± 1111.22 Triglyceride Male 132.82 ± 86.12 < 0.001 Female 118.91 ± 66.80 Cholesterol Male 178.39 ± 36.45 < 0.001 Female 187.96 ± 37.59 HDL Male 47.23 ± 14.23 < 0.001 Female 54.32 ± 16.53 LDL Male 104.56 ± 30.67 < 0.001 Female 109.83 ± 32.11 Waist hip ratio Male 1.39 ± 0.64 < 0.001 Female 2.84 ± 0.45 BMR Male 6628.84 ± 984.17 < 0.001 Female 5418.65 ± 628.36 Categorical Variable Male N(%) Female N(%) P-value Have a diabetes No yes 3,710(97.48) 96(2.52) 3,462(96.11) 140(3.89) < 0.001 BMI 1 2 3 797(20.94) 1,588(41.72) 1,421(37.33) 292(8.10) 1,159(32.15) 2,151(59.68) < 0.001 Smoking No yes 2,117) 55.62) 1,689(44.38) 127(3.53) 3,475(96.47) < 0.001 Drug users No yes 2,017(53.00) 1,789(47.00) 3,583(99.47) 19(0.53) < 0.001 Marital status 1 2 3 4 88(2.31) 3,701(97.24) 5(0.13) 12(0.32) 246(6.83) 2,984 (82.84) 309(8.58) 63(1.75) < 0.001 5. Evaluation Metrics In predictive modeling for conditions like diabetes where dataset imbalance is prevalent, reliance on standard accuracy metrics can be misleading. Therefore, we employ three alternative metrics: the F1 score, AUC, and G-means. Each metric provides a distinct perspective on model performance, addressing the issues inherent in imbalanced datasets ( 46 ). F1 Score The F1 score, calculated as 2 * (precision * recall) / (precision + recall), reflects the model's balance between precision (the proportion of true positives out of all positive predictions) and recall (the proportion of true positives out of actual positive cases). It's an important metric in medical predictions due to the high cost of false negatives and positives. AUC AUC represents a model's ability to differentiate classes, ranging from 0.5 (no better than random) to 1 (perfect classification). It is summarized from the ROC curve, which plots sensitivity (true positive rate) against 1-specificity (false positive rate). AUC is favored in imbalanced datasets as it is not influenced by the skew in class distribution. Geometric Mean Score (G-means) The G-means metric takes the square root of the product of sensitivity and specificity, effectively capturing a model's performance on both minority and majority classes. In mathematical terms, G-means = sqrt(sensitivity * specificity). It ensures that the model is not overly biased toward the predominant class, with higher values indicating a balanced classification performance. 6. Performance Analysis 6.1. Interpretation of Correlation Matrices The correlation matrices provided offer an intricate look at the relationships between various features for diabetes prediction across three different groups: female, male, and the entire dataset. Below is an analysis of how these relationships manifest and what they reveal about diabetes prediction. Figure 2 and Figur3 show the correlation matrix between covariates and the incidence of diabetes in female and male are the same. A high correlation was found between diastolic blood pressure and systolic blood pressure, BMI and bmr, SGPT and SGOT, LDL and cholesterol. Figure 4 demonstrates the correlation matrix between covariates for the prediction of diabetes. The result shows that a high correlation was between gender and WHR, smoking and gender, cholesterol and LDL, SPGT and SGOT, diastolic blood pressure and systolic blood pressure. 6.2. Interpretation of Feature Importance Analysis The feature importance plots (Figs. 5 , 6 and 7 ) provide critical insights into the most influential factors affecting the appearance of diabetes among the different groups analyzed. Below is an interpretation of the feature importance results for male, female, and the entire data. Figure 5 shows all important features for prediction diabetes in female. This highlights that female whit high level of the Triglyceride (TG), bmr, cholesterol, Energy intake, HDL, BMI, GGT, LDL, SGPT, SGOT, PR, MET, ALP, assert index, BUN, age and euduation are the primary drivers of diabetes risk. Figure 6 indicates the important features that predict diabetes in male. According to the results, the most importance variable is BMI as follow: SGOT, GGT, ALP, socioeconomic status, TG, cholesterol, BMR, LDL, HDL-C, SGPT, MET, Energy, age, BUN and others that show in this figure. In entire data, the results show that BMI is the most important variable for the prediction of diabetes. Other variables respectively include SGOT, bmr, Energy, GGT, TG, LDL, cholesterol, APL, HDL, socioeconomic status, SPGT and age. Other variables are shown in Fig. 7 . These results highlight that important variables differ in male and female and total population. These insights can help guide targeted prevention and management strategies for diabetes based on gender-specific risk profiles. 7. Selection of Optimal Classifier-Sampling Combinations All possible combinations of the introduced resampling methods and classifiers, in Fig. 1 , were implemented on the study data. The acceptable outcomes show in Table 3 . The analysis reveals that resampling techniques, particularly RandomOverSampling, SMOTE, ADASYN, and KMeansSMOTE, significantly enhance the performance of machine learning models in handling class imbalance. The Multi-Layer Perceptron (MLP) consistently demonstrates superior performance across various resampling methods, indicating its robustness and adaptability in complex data scenarios. 7.1. Impact of Resampling Techniques RandomOverSampling This technique, particularly when paired with MLP, achieves the highest F1 score of 82.97 ± 2.46, along with a strong AUC of 89.25 ± 1.57 and G-Mean of 88.73 ± 1.75. The exceptional performance of RandomOverSampling with MLP suggests that simply increasing the representation of minority class examples can significantly improve model training, especially in neural network-based models. This highlights the potential of RandomOverSampling in scenarios where the model architecture can effectively leverage the additional data without overfitting. SMOTE and ADASYN Both SMOTE and ADASYN show strong performance improvements, particularly with MLP and RandomForest. For instance, SMOTE with MLP results in an F1 score of 79.85 ± 3.91, AUC of 89.7 ± 2.54, and G-Mean of 89.31 ± 2.78. ADASYN with MLP achieves the highest F1 score among the ADASYN combinations, with an F1 score of 82.17 ± 3.38, AUC of 89.61 ± 2.09, and G-Mean of 89.15 ± 2.31. These techniques not only address the imbalance by generating synthetic examples but also enhance the model's ability to generalize, as evidenced by the high AUC and G-means scores. The consistent performance gains suggest that the synthetic data generated by these methods provides meaningful and varied examples that help the model better understand the decision boundaries. KMeansSMOTE The robust results achieved with KMeansSMOTE, especially with MLP, emphasize the importance of intelligently generating synthetic samples. KMeansSMOTE with MLP achieves an F1 score of 78.33 ± 6.98, AUC of 88.25 ± 2.25, and G-Mean of 87.73 ± 2.42. By clustering the data before generating synthetic samples, KMeansSMOTE ensures that the new data points are more representative of the underlying distribution, thereby enhancing model performance in terms of both F1 score and AUC. Table 3 The optimal classifier sampling combination results. Machine Learning Models Resampling techniques step f1 (mean ± sd) AUC (mean ± sd) G-means (mean ± sd) RandomForest SMOTEENN before 70.77 ± 3.57 77.44 ± 2.15 74.03 ± 2.89 after 78.27 ± 1.54 87.18 ± 1.12 86.47 ± 1.28 MLP SMOTEENN before 42.22 ± 5.39 64.63 ± 2.24 54.88 ± 3.94 after 71.33 ± 1.99 88.76 ± 1.17 88.52 ± 1.28 Gradient Boosting SMOTE before 65.77 ± 1.77 74.65 ± 1.01 70.23 ± 1.46 after 71.63 ± 2.42 85.47 ± 1.66 84.71 ± 1.88 RandomForest SMOTE before 70.77 ± 3.57 77.44 ± 2.15 74.03 ± 2.89 after 82.18 ± 2.76 85.97 ± 1.84 84.85 ± 2.14 DecisionTreeClassifier SMOTE before 62.01 ± 2.95 80.69 ± 1.71 79.33 ± 2.01 after 63.88 ± 2.84 83.59 ± 2.11 82.84 ± 2.38 MLP SMOTE before 40.60 ± 5.43 63.8 ± 2.3 53.28 ± 4.13 after 79.85 ± 3.91 89.7 ± 2.54 89.31 ± 2.78 RandomForest RandomOverSampling before 70.77 ± 3.57 77.44 ± 2.15 74.03 ± 2.89 after 78.56 ± 3.60 82.72 ± 2.51 80.85 ± 3.11 MLP RandomOverSampling before 42.25 ± 2.41 64.64 ± 1.17 54.99 ± 2.14 after 82.97 ± 2.46 89.25 ± 1.57 88.73 ± 1.75 Gradient Boosting ADASYN before 65.77 ± 1.77 74.65 ± 1.01 70.23 ± 1.46 after 68.23 ± 0.98 85.17 ± 1.29 84.52 ± 1.49 RandomForest ADASYN before 70.77 ± 3.6 77.44 ± 2.15 74.03 ± 2.89 after 81.24 ± 3.47 86.02 ± 2.42 84.93 ± 2.84 MLP ADASYN before 40.24 ± 4.98 63.52 ± 2.03 52.7 ± 3.6 after 82.17 ± 3.38 89.61 ± 2.09 89.15 ± 2.31 Gradient Boosting KMeansSMOTE before 65.77 ± 1.77 74.65 ± 1.01 70.23 ± 1.46 after 69.08 ± 4.15 77.04 ± 2.66 73.53 ± 3.52 RandomForest KMeansSMOTE before 70.77 ± 3.57 77.44 ± 2.15 74.03 ± 2.89 after 74.66 ± 4.36 79.92 ± 2.82 77.28 ± 3.59 MLP KMeansSMOTE before 38.66 ± 4.6 62.84 ± 1.97 51.32 ± 4.03 after 78.33 ± 6.98 88.25 ± 2.25 87.73 ± 2.42 7.2. Model-Specific Insights Multi-Layer Perceptron (MLP) MLP stands out as the most effective model across various resampling techniques, consistently achieving high F1 scores, AUC, and G-means. These results are illustrated in Fig. 8 . For example, MLP with RandomOverSampling achieves an F1 score of 82.97 ± 2.46, AUC of 89.25 ± 1.57, and G-Mean of 88.73 ± 1.75. This indicates that neural networks, with their capacity to model complex relationships, benefit significantly from balanced datasets. The adaptability of MLP to various resampling methods underscores its potential as a versatile tool in predictive modeling for imbalanced data. RandomForest Figure 9 indicates that the RandomForest model also demonstrates substantial improvements with resampling techniques, particularly with RandomOverSampling and ADASYN. For instance, RandomForest with ADASYN achieves an F1 score of 81.24 ± 3.47, AUC of 86.02 ± 2.42, and G-Mean of 84.93 ± 2.84. The inherent ability of RandomForest to handle variability and reduce overfitting makes it well-suited to benefit from the additional or synthetic samples provided by resampling methods. 7.3. Broader Implications Improving Predictive Reliability The substantial improvements in F1 score, AUC, and G-means across most resampling techniques underscore the critical role of data balancing in predictive modeling. By addressing class imbalance, these methods not only improve the accuracy of predictions but also enhance the reliability and robustness of the models. This is particularly important in medical applications where predictive accuracy can directly impact patient outcomes. Algorithm and Technique Selection The findings suggest a strategic approach to selecting resampling techniques and machine learning models based on the specific characteristics of the dataset and the desired outcomes. For example, in scenarios where neural networks are preferred, RandomOverSampling or SMOTE with MLP might be the optimal choice. Conversely, for tree-based models, ADASYN might offer significant performance benefits. To evaluate the performance of the combination of various resampling techniques and the MLP model, the Receiver Operating Characteristic (ROC) curves and loss functions before and after the combination are illustrated in Figs. 10 and 11 , respectively. In Fig. 11 , loss trends for different sampling methods used with the MLP classifier trained with a learning rate of 0.0005, 150 epochs, a validation split of 0.2, and a binary cross-entropy loss function. Across all sampling methods, the training and validation loss curves steadily decline and converge towards low loss values, indicating that each approach helped reduce data imbalance and enhanced the classifier's ability to learn from and generalize to the validation set. Minor spikes represent the natural fluctuations of complex learning processes. Overall five sampling methods improved the model’s predictive accuracy and generalization, as reflected by the consistent loss trends. 8. Conclusion This study explored the predictive power of machine learning models combined with advanced data balancing techniques to forecast diabetes incidence in an adult cohort over a 5-year period. Resampling methods like SMOTE, ADASYN, RandomOverSampling, and KMeansSMOTE effectively improved model performance, addressing the challenge of data imbalance. Post-sampling, most models showed enhanced predictive accuracy, particularly in F1 scores and AUC measures. RandomOverSampling with MLP and ADASYN with MLP were identified as the most effective pairings, achieving significant gains in AUC, F1, and G-means scores. Additionally, the RandomOverSampling with RandomForest combination effectively addressed class imbalance, demonstrating notable improvements in predictive performance. These findings underscore the importance of balancing techniques in medical data analysis, providing a clear pathway to develop more reliable predictive models. Future research will focus on feature selection methods, particularly leveraging autoencoders for dimensionality reduction and feature extraction. Finally, refining algorithm-level approaches for handling imbalanced data will include integrating ensemble learning with specialized cost-sensitive classifiers that prioritize the minority class. Techniques such as hybrid ensemble methods that combine boosting and bagging, or innovative architectures like one-class neural networks, could be explored for better detection of diabetic cases. Furthermore, incorporating reinforcement learning for adaptive resampling strategies may provide a dynamic approach to data balancing. Also, these findings can help guide targeted prevention and management strategies for prevention and control diabetes based on gender-specific risk profiles . Declarations Data Availability Statement: Data can be inquired from the corresponding author. Authors' contributions: MT, ZA and YJH: providing the main idea of study and methodology, final analysis, developing the idea and revising the final manuscript, MKH and MSH : developing the idea and revising the final manuscript, contributed to data analysis and revising the final manuscript. ADH and GHN revised the final manuscript. All authors approved the final version of the manuscript that is submitted. Conflict of interest: The authors declare that there is no conflict of interest. Ethics approval and consent to participate: Ethical issues including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, redundancy, etc. were completely observed by the authors. This study was performed according to the ethical guidelines expressed in the Declaration of Helsinki and the Strengthening of the Reporting of Observational Studies in Epidemiology (STORB) guideline. The study was also approved by the Research Ethics Committee of Fasa University of Medical Sciences (IR.FUMS.REC.1402.172). Informed consent was also waived by the Research Ethics Committee of Fasa University of Medical Sciences (IR.FUMS.REC.1402.172). Funding: Fasa University of Medical Sciences. Consent for publication: Not applicable. Acknowledgments : We would also like to thank Fasa University of Medical Sciences for supporting this research. References Hameed I, Masoodi SR, Mir SA, Nabi M, Ghazanfar K, Ganai BA. Type 2 diabetes mellitus: from a metabolic disorder to an inflammatory condition. World J diabetes. 2015;6(4):598. Kaze AD, Jaar BG, Fonarow GC, Echouffo-Tcheugui JB. Diabetic kidney disease and risk of incident stroke among adults with type 2 diabetes. BMC Med. 2022;20(1):127. Sattar N, Presslie C, Rutter MK, McGuire DK. Cardiovascular and Kidney Risks in Individuals With Type 2 Diabetes: Contemporary Understanding With Greater Emphasis on Excess Adiposity. Diabetes Care. 2024:dci230041. Saeedi P, Petersohn I, Salpea P, Malanda B, Karuranga S, Unwin N et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas. Diabetes research and clinical practice. 2019;157:107843. Safiri S, Karamzad N, Kaufman JS, Bell AW, Nejadghaderi SA, Sullman MJ, et al. Prevalence, deaths and disability-adjusted-life-years (DALYs) due to type 2 diabetes and its attributable risk factors in 204 countries and territories, 1990–2019: results from the global burden of disease study 2019. Front Endocrinol. 2022;13:838027. Dagliati A, Marini S, Sacchi L, Cogni G, Teliti M, Tibollo V, et al. Machine learning methods to predict diabetes complications. J Diabetes Sci Technol. 2018;12(2):295–302. Alghamdi T. Prediction of diabetes complications using computational intelligence techniques. Appl Sci. 2023;13(5):3030. Dutta A, Hasan MK, Ahmad M, Awal MA, Islam MA, Masud M, et al. Early prediction of diabetes using an ensemble of machine learning models. Int J Environ Res Public Health. 2022;19(19):12378. Shin J, Kim J, Lee C, Yoon JY, Kim S, Song S, et al. Development of various diabetes prediction models using machine learning techniques. Diabetes Metabolism J. 2022;46(4):650. Lyra S, Leonhardt S, Antink CH, editors. Early prediction of sepsis using random forest classification for imbalanced clinical data. IEEE; 2019. 2019 Computing in Cardiology (CinC). He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84. López V, Fernández A, Moreno-Torres JG, Herrera F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl. 2012;39(7):6585–608. Kumar M, Sheshadri H. On the classification of imbalanced datasets. Int J Comput Appl. 2012;44(8):1–7. Sun Y, Wong AK, Kamel MS. Classification of imbalanced data: A review. Int J Pattern recognit Artif Intell. 2009;23(04):687–719. Chawla NV, Japkowicz N, Kotcz A. Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsl. 2004;6(1):1–6. Liu Q, Zhang M, He Y, Zhang L, Zou J, Yan Y, et al. Predicting the risk of incident type 2 diabetes mellitus in Chinese elderly using machine learning techniques. J Personalized Med. 2022;12(6):905. Sadeghi S, Khalili D, Ramezankhani A, Mansournia MA, Parsaeian M. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med Inf Decis Mak. 2022;22(1):36. Hassan MM, Amiri N. Classification of imbalanced data of diabetes disease using machine learning algorithms. Age (years). 2019;21(81):3324. Khushi M, Shaukat K, Alam TM, Hameed IA, Uddin S, Luo S, et al. A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access. 2021;9:109960–75. Kumar MS, Khan MZ, Rajendran S, Noor A, Dass AS, Prabhu J. Imbalanced classification in diabetics using ensembled machine learning. Computers Mater Continua. 2022;72(3):4397–409. Awe OO, Ojumu JB, Ayanwoye GA, Ojumoola JS, Dias R. Machine Learning Approaches for Handling Imbalances in Health Data Classification. Sustainable Statistical and Data Science Methods and Practices: Reports from LISA 2020 Global Network, Ghana, 2022: Springer; 2024. pp. 375 – 91. Nugraha W, Maulana R, Latifah L, Rahayuningsih PA, Nurmalasari N, editors. Over-sampling strategies with data cleaning for handling imbalanced problems for diabetes prediction. AIP Conference Proceedings; 2023: AIP Publishing. Hairani Hairani H, Dadang Priyanto D. A New Approach of Hybrid Sampling SMOTE and ENN to the Accuracy of Machine Learning Methods on Unbalanced. Diabetes Disease Data. 2023;14(8):585–890. A new approach of hybrid sampling SMOTE and ENN to the accuracy of machine learning methods on unbalanced diabetes disease data. Homayounfar R, Farjam M, Bahramali E, Sharafi M, Poustchi H, Malekzadeh R, et al. Cohort Profile: The Fasa Adults Cohort Study (FACS): a prospective study of non-communicable diseases risks. Int J Epidemiol. 2023;52(3):e172–8. Farjam M, Bahrami H, Bahramali E, Jamshidi J, Askari A, Zakeri H, et al. A cohort study protocol to analyze the predisposing factors to common chronic non-communicable diseases in rural areas: Fasa Cohort Study. BMC Public Health. 2016;16:1–8. Ahuja V, Aronen P, Pramodkumar TA, Looker H, Chetrit A, Bloigu AH, et al. Accuracy of 1-Hour Plasma Glucose During the Oral Glucose Tolerance Test in Diagnosis of Type 2 Diabetes in Adults: A Meta-analysis. Diabetes Care. 2021;44(4):1062–9. Shantal M, Othman Z, Bakar AA. A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min–Max Normalization. Symmetry. 2023;15(12):2185. Chowdhury MM, Ayon RS, Hossain MS. An investigation of machine learning algorithms and data augmentation techniques for diabetes diagnosis using class imbalanced BRFSS dataset. Healthc Analytics. 2024;5:100297. Yang C, Fridgeirsson EA, Kors JA, Reps JM, Rijnbeek PR. Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. J Big Data. 2024;11(1):7. Ramezankhani A, Pournik O, Shahrabi J, Azizi F, Hadaegh F, Khalili D. The impact of oversampling with SMOTE on the performance of 3 classifiers in prediction of type 2 diabetes. Med Decis Making. 2016;36(1):137–44. He H, Bai Y, Garcia EA, Li S, editors. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence); 2008: Ieee. Mohanty MN. Advances in intelligent computing and communication. Springer; 2021. Douzas G, Bacao F, Last F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci. 2018;465:1–20. Sharma A, Singh PK, Chandra R. SMOTified-GAN for class imbalanced pattern classification problems. Ieee Access. 2022;10:30655–65. Muntasir Nishat M, Faisal F, Jahan Ratul I, Al-Monsur A, Ar-Rafi AM, Nasrullah SM, et al. A comprehensive investigation of the performances of different machine learning classifiers with SMOTE-ENN oversampling technique and hyperparameter optimization for imbalanced heart failure dataset. Sci Program. 2022;2022:1–17. Wang Z, Wu C, Zheng K, Niu X, Wang X. SMOTETomek-based resampling for personality recognition. Ieee Access. 2019;7:129678–89. Tu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol. 1996;49(11):1225–31. Imandoust SB, Bolandraftar M. Application of k-nearest neighbor (knn) approach for predicting economic events: Theoretical background. Int J Eng Res Appl. 2013;3(5):605–10. Burbidge R, Buxton B. An introduction to support vector machines for data mining. Keynote papers, young OR12. 2001:3–15. Kalcheva N, Todorova M, Marinova G, editors. Naive Bayes Classifier, Decision Tree and AdaBoost Ensemble Algorithm–Advantages and Disadvantages. Proceedings of the 6th ERAZ Conference Proceedings (part of ERAZ conference collection), Online; 2020. Aria M, Cuccurullo C, Gnasso A. A comparison among interpretative proposals for Random Forests. Mach Learn Appl. 2021;6:100094. Hao L, Huang G. An improved AdaBoost algorithm for identification of lung cancer based on electronic nose. Heliyon. 2023;9(3). Ahn JM, Kim J, Kim K. Ensemble machine learning of gradient boosting (XGBoost, LightGBM, CatBoost) and attention-based CNN-LSTM for harmful algal blooms forecasting. Toxins. 2023;15(10):608. Elmogy AM, Tariq U, Ammar M, Ibrahim A. Fake reviews detection using supervised machine learning. Int J Adv Comput Sci Appl. 2021;12(1). Singh SK, Taylor RW, Pradhan B, Shirzadi A, Pham BT. Predicting sustainable arsenic mitigation using machine learning techniques. Ecotoxicol Environ Saf. 2022;232:113271. Susan S, Kumar A. The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Eng Rep. 2021;3(4):e12298. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 27 Sep, 2024 Read the published version in BMC Medical Research Methodology → Version 1 posted Editorial decision: Revision requested 16 Aug, 2024 Reviews received at journal 12 Aug, 2024 Reviews received at journal 04 Aug, 2024 Reviews received at journal 30 Jul, 2024 Reviewers agreed at journal 26 Jul, 2024 Reviewers agreed at journal 26 Jul, 2024 Reviewers agreed at journal 25 Jul, 2024 Reviewers agreed at journal 25 Jul, 2024 Reviewers agreed at journal 25 Jul, 2024 Reviewers agreed at journal 25 Jul, 2024 Reviewers agreed at journal 24 Jul, 2024 Reviewers invited by journal 24 Jul, 2024 Editor assigned by journal 24 Jul, 2024 Submission checks completed at journal 24 Jul, 2024 First submitted to journal 20 Jul, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4772777","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":341033650,"identity":"0f036682-9dda-412f-b6c0-4eea69101d25","order_by":0,"name":"Maryam Talebi Moghaddam","email":"","orcid":"","institution":"Fasa University of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Maryam","middleName":"Talebi","lastName":"Moghaddam","suffix":""},{"id":341033651,"identity":"f42052d0-ccf8-4d16-aadb-35479896fc8a","order_by":1,"name":"Yones Jahani","email":"","orcid":"","institution":"Kerman University of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Yones","middleName":"","lastName":"Jahani","suffix":""},{"id":341033652,"identity":"a4496031-0139-4cc9-9853-2be1a82bc741","order_by":2,"name":"Zahra Arefzadeh","email":"","orcid":"","institution":"Persian Gulf University","correspondingAuthor":false,"prefix":"","firstName":"Zahra","middleName":"","lastName":"Arefzadeh","suffix":""},{"id":341033653,"identity":"c68f826a-d74c-4b7a-ab32-0305bc514e60","order_by":3,"name":"Azizallah Dehghan","email":"","orcid":"","institution":"Fasa University of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Azizallah","middleName":"","lastName":"Dehghan","suffix":""},{"id":341033654,"identity":"70e7ed12-53cb-456f-99e5-abf352a8e9f3","order_by":4,"name":"Mohsen Khaleghi","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA50lEQVRIiWNgGAWjYBACCQkGxgMMDAd42BiYD4D4MsRoYYBqYUsA8XmI1gJk8hiABAhrkZzd/ODAh193ZPikz3x+daPGgoeB/fDRDfi0SMscMzg4s+8ZDxtf7jbrnGNAh/Gkpd3Ap0VOIsHgMG/PYR42Ht5txjlsQC0SPGYEtKR/OPwXrIXnmXHOPyK0SEvkGBxm+AHWwvw4t40ILZIzcgoO9jaAtLCZMef2SQAZBPwicSN944Mffw7by/cwP/6c861Ojp/98DG8WsCAsQ1MsUmASYLKweAPmGT+QJzqUTAKRsEoGGkAAPCzR6eH72S/AAAAAElFTkSuQmCC","orcid":"","institution":"Islamic Azad University","correspondingAuthor":true,"prefix":"","firstName":"Mohsen","middleName":"","lastName":"Khaleghi","suffix":""},{"id":341033655,"identity":"15d64e7a-8928-495d-8074-2d387020af92","order_by":5,"name":"Mehdi Sharafi","email":"","orcid":"","institution":"Hormozgan University of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Mehdi","middleName":"","lastName":"Sharafi","suffix":""},{"id":341033656,"identity":"8a9f81b4-b410-43aa-be62-088d47eda8db","order_by":6,"name":"Ghasem Nikfar","email":"","orcid":"","institution":"Fasa University of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Ghasem","middleName":"","lastName":"Nikfar","suffix":""}],"badges":[],"createdAt":"2024-07-20 10:58:41","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4772777/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4772777/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s12874-024-02341-z","type":"published","date":"2024-09-27T15:57:50+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":63369841,"identity":"0979ab88-e54f-48c3-b812-a44be2fd3821","added_by":"auto","created_at":"2024-08-27 11:46:27","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":78557,"visible":true,"origin":"","legend":"\u003cp\u003eThe workflow algorithm of our research.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/2e615137bcf6699d79412c53.png"},{"id":63368653,"identity":"d2d5f719-b6e4-4022-90ac-bc9f5650e38b","added_by":"auto","created_at":"2024-08-27 11:38:28","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":501234,"visible":true,"origin":"","legend":"\u003cp\u003eCorrelation matrix for female\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/703648b6734e5f578b46ed18.png"},{"id":63368647,"identity":"b0ad7f3e-e69a-44ef-841d-4fa7f033c96b","added_by":"auto","created_at":"2024-08-27 11:38:27","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":481515,"visible":true,"origin":"","legend":"\u003cp\u003eCorrelation matrix for Males.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/51dc212001560eba744bccb8.png"},{"id":63368651,"identity":"dff9fad9-c490-4482-af75-781f0c6fc768","added_by":"auto","created_at":"2024-08-27 11:38:27","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":495381,"visible":true,"origin":"","legend":"\u003cp\u003ecorrelation matrix between covariates for the prediction of diabetes in entire data.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/158f5067872b3e0cb813f432.png"},{"id":63368654,"identity":"73ad2b59-8594-425a-932b-e35876271930","added_by":"auto","created_at":"2024-08-27 11:38:28","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":72941,"visible":true,"origin":"","legend":"\u003cp\u003eshows all important features of female.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/c53d0d08432f81382e88f794.png"},{"id":63368655,"identity":"cc8ddb22-2880-4979-8d3f-e9ccc5f49a7c","added_by":"auto","created_at":"2024-08-27 11:38:28","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":75854,"visible":true,"origin":"","legend":"\u003cp\u003eshows all important features of male..\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/37d6e121b1b9ee3921b2b7a0.png"},{"id":63368657,"identity":"b713c1f9-eee3-4573-bd47-3c1c7d9d6d68","added_by":"auto","created_at":"2024-08-27 11:38:28","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":79499,"visible":true,"origin":"","legend":"\u003cp\u003eshows all important features of entire data..\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/065c169efdd6945b265648d9.png"},{"id":63369842,"identity":"1452fe21-bfd1-4f26-90d3-5f242aebfa67","added_by":"auto","created_at":"2024-08-27 11:46:27","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":105219,"visible":true,"origin":"","legend":"\u003cp\u003eMLP model results before and after data resampling technique.\u003c/p\u003e","description":"","filename":"8.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/736969b0318e1f98ed875a58.png"},{"id":63368658,"identity":"c0ca1c33-3c71-4e26-8036-249541958ac8","added_by":"auto","created_at":"2024-08-27 11:38:28","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":102994,"visible":true,"origin":"","legend":"\u003cp\u003eRandomForest model results before and after data resampling technique.\u003c/p\u003e","description":"","filename":"9.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/5c54ce5b78242facf6c60bd5.png"},{"id":63368649,"identity":"3b7a9ba3-9b25-4d5c-b776-cc0a75b462b8","added_by":"auto","created_at":"2024-08-27 11:38:27","extension":"png","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":53845,"visible":true,"origin":"","legend":"\u003cp\u003eROC curves for different sampling methods used with the MLP classification.\u003c/p\u003e","description":"","filename":"10.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/e152a4ebf00ace44867f665b.png"},{"id":63368656,"identity":"1af6aa3c-903f-4033-b1dc-0c04b0718af2","added_by":"auto","created_at":"2024-08-27 11:38:28","extension":"png","order_by":11,"title":"Figure 11","display":"","copyAsset":false,"role":"figure","size":60804,"visible":true,"origin":"","legend":"\u003cp\u003eLoss function for different sampling methods used with the MLP classification.\u003c/p\u003e","description":"","filename":"11.png","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/c081da234b81434172ce0361.png"},{"id":65628076,"identity":"cba5e704-f841-4a9f-8dbf-c0a8a6bc19eb","added_by":"auto","created_at":"2024-09-30 16:17:38","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3282421,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4772777/v1/fa7e9f4c-b353-49e9-863d-4ece046fcf3f.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Predicting Diabetes in Adults: Identifying Important Features in Unbalanced Data Over a 5-Year Cohort Study Using Machine Learning Algorithm","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eType 2 diabetes, is a chronic metabolic disorder characterized by insulin resistance or insufficient insulin production, that significantly contributes to the global burden of disease (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e). It is associated with severe complications, including heart disease, stroke, kidney failure, blindness, and lower-limb amputation. It has been linked to increased risks of dementia, hearing loss, and certain cancers, thereby heightening the risk of premature mortality (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe incidence of diabetes is alarmingly on the rise. According to the International Diabetes Federation, the global diabetes population was 382\u0026nbsp;million in 2013, anticipated to surge to 592\u0026nbsp;million by 2035 (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e). Similarly, a study highlighted that the prevalence among adults was 6.4% in 2010, affecting 285\u0026nbsp;million adults, with projections indicating an increase to 7.7%, affecting 439\u0026nbsp;million adults by 2030. This escalating trend underscores the critical need for robust predictive tools to effectively manage and mitigate the disease's impact. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e)\u003c/p\u003e \u003cp\u003eIn response, machine learning (ML) techniques are increasingly leveraged to forecast the onset of diabetes and its complications (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). These methods have shown considerable efficacy in enhancing risk prediction, prognosis, treatment, and management strategies (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). Popular ML models used in diabetes prediction include Random Forest, K-NN, neural networks, support vector machines, decision trees, and extra trees (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eA notable challenge in this domain is the prevalence of data imbalance in clinical datasets, which typically include variables like blood sugar and blood pressure (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e). Such imbalance can drastically affect the performance of predictive models, often resulting in biased outcomes and less reliable predictions. This is particularly critical in forecasting the incidence of type 2 diabetes, where the accuracy of predictions can significantly influence preventive and therapeutic measures (\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eTo tackle these challenges, our study adopts both data-level and algorithm-level interventions (\u003cspan additionalcitationids=\"CR12 CR13\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). We explore oversampling, under sampling, and hybrid sampling techniques to correct data imbalances (\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). Additionally, we utilize a range of ML algorithms, evaluating their effectiveness through metrics such as the F1 score, AUC, and G-means indices to identify the most proficient approaches in predicting the incidence of diabetes (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e).\u003c/p\u003e"},{"header":"2. Related Work","content":"\u003cp\u003eRecent studies have addressed the challenges of applying machine learning algorithms to imbalanced datasets, particularly in the prediction of diabetes. These efforts are marked by the development of various resampling methods and algorithmic adjustments to improve predictive performance.\u003c/p\u003e \u003cp\u003eIn 2022, Somieh et al. utilized Deep Neural Network (DNN), Extreme Gradient Boosting (XGBoost), and Random Forest (RF) algorithms on the Tehran Lipid and Glucose Study (TLGS) cohort data, which was notably imbalanced. Their findings highlighted that undersampling methods yielded superior results compared to other techniques in managing data skewness (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eMasoud Mohammad Hassan et al. (2019) explored six different algorithms\u0026mdash;logistic regression, decision tree, K-Nearest Neighbors, Naive Bayes, Support Vector Machine, and artificial neural networks\u0026mdash;on the Pima dataset. They implemented the Synthetic Minority Over-sampling Technique (SMOTE) to address data imbalance. The study concluded that SVM benefited most significantly from this resampling method, demonstrating enhanced model performance (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eMATLOOB KHUSHI et al. (2021) conducted a study using lung cancer datasets, PLCO and NLST, both characterized by imbalance. They employed 23 resampling models alongside hybrid systems, using logistic regression, Random Forest, and LinearSVC to determine the most effective forecasting model. Their results indicated that under-sampling techniques generally exhibited higher standard deviations, while over-sampling resulted in lower variances. Random Forest was identified as having the best predictive ability for the lung cancer datasets used (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eM. Sandeep Kumar et al. (2022) used six algorithms including k-nearest Neighbor, Naive Bayes, Support Vector Machines, Random Forest, Logistic Regression, and Decision Trees on the PIMA Indian dataset. They tested both oversampling and undersampling methods. Their results suggested that SVM outperformed other models in dealing with imbalanced data (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eA significant study in 2024 by O. Olawale Awe et al. investigated the use of various resampling algorithms\u0026mdash;including random oversampling, SMOTE, ADASYN, random subsampling, Tomek linkages, NearMiss, and others\u0026mdash;across four imbalanced datasets related to diabetes, anemia, lung cancer, and obesity. They found that the Repeated Nearest Neighbor Sampling method (RENN) combined with logistic regression achieved the most substantial improvement in predictions (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn 2023, Wahyu Nugraha et al. focused on the Pima Indians dataset, employing the SMOTE\u0026thinsp;+\u0026thinsp;Tomek link method along with a decision tree classification algorithm. Their experimental results showed that this combination performed better than using SMOTE without Tomek links (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eAlso in 2023, Hirani Hairani and Dadang Priyanto studied the same Pima Indian dataset using SVM and Random Forest with SMOTE-ENN. They concluded that the Random Forest method with SMOTE-ENN outperformed the SVM method (\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThese studies collectively illustrate a diverse range of strategies for addressing data imbalance in diabetes prediction, highlighting the effectiveness of different resampling techniques and machine learning algorithms.\u003c/p\u003e"},{"header":"3. METHODS AND MATERIALS","content":"\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1. Study Population\u003c/h2\u003e \u003cp\u003eThis study utilizes data from the Fasa Adult Cohort Study (FACS), initiated in 2016 with baseline data collection completed in 2018. The FACS monitors individuals annually to observe the occurrence of various health events including cardiovascular diseases, diabetes, and hypertension. To date, a 5-year follow-up has been conducted for all participants, with an initial cohort size of 10,000 individuals. Data were gathered through detailed interviews conducted by trained personnel, using comprehensive questionnaires. These questionnaires covered a wide range of topics such as demographics, nutritional status, physical activity, personal habits, and history of chronic or underlying diseases. Additionally, participants underwent blood, urine, and stool tests, along with anthropometric assessments. More comprehensive details about the FACS methodology are available in the study\u0026rsquo;s protocol and profile publications. (\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e, \u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e)\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2. Definition of Variables\u003c/h2\u003e \u003cp\u003eThe primary outcome, or dependent variable, of this study is the 5-year cumulative incidence of Type 2 diabetes. Diagnosis is primarily determined using the hA1C test as recommended by the American Diabetes Association, where a result of 6.5% or higher suggests diabetes. Alternatively, a 2-hour plasma glucose (2-h PG) value of 200 mg/dL or higher during an Oral Glucose Tolerance Test (OGTT) also indicates diabetes (\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eThe independent variables include demographic factors such as sex, age, occupation, and education level. Health-related variables include the presence of cardiovascular diseases, smoking status, opium use, and chronic conditions like kidney disorders, fatty liver, and lung disease. Anthropometric measurements taken into account are body mass index (BMI), waist circumference, and weight. Additionally, the study considers Medical Equivalent Task (MET) levels, socioeconomic status, and lipid profiles, including LDL, HDL, total cholesterol, and triglycerides, as potential predictors.\u003c/p\u003e \u003cp\u003eThis robust methodological framework is designed to accurately capture the complex interplay of various factors contributing to the incidence of Type 2 diabetes, facilitating a comprehensive analysis of risks associated with the disease.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e3.3. Data preparation and pre-processing\u003c/h2\u003e \u003cp\u003eData preprocessing is an essential phase in machine learning projects, setting the foundation for the effectiveness of the analysis. In this study, the preprocessing stage was meticulously designed to ensure the data's suitability for the applied machine learning models, focusing on handling missing values and data normalization.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec7\" class=\"Section3\"\u003e \u003ch2\u003e3.3.1. Handling Missing Values\u003c/h2\u003e \u003cp\u003eIn dealing with missing values, our approach was to maintain the integrity and quality of the dataset by removing any rows that contained missing data. This decision was based on the premise that the presence of missing values could introduce bias or inaccuracies into our models' predictions. Although this method resulted in a reduction of the dataset size, it ensured that the remaining data was complete, thereby improving the reliability of our analysis. The direct removal of rows with missing data was deemed the most straightforward and effective strategy, considering the dataset's sufficient size and the distribution of missing values across the dataset.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section3\"\u003e \u003ch2\u003e3.3.2. Data Normalization\u003c/h2\u003e \u003cp\u003eNormalization was the primary step in our data preprocessing, specifically employing min-max scaling. This technique adjusts the values of numeric columns in the dataset to a common scale, between 0 and 1, without distorting differences in the ranges of values or losing information. Min-max scaling is mathematically represented as:\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:{X}_{norm}=\\frac{X-{X}_{min}}{{X}_{max}-{X}_{min}}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:X\$\u003c/span\u003e\u003c/span\u003e is the original value, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{X}_{min}\$\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{X}_{max}\$\u003c/span\u003e\u003c/span\u003e are the minimum and maximum values for the feature, respectively, and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:{X}_{norm}\$\u003c/span\u003e\u003c/span\u003e is the normalized value.\u003c/p\u003e \u003cp\u003eThis step is crucial, particularly in our context, where the dataset encompasses a wide range of physiological and clinical measures. Normalizing these values ensures that no single feature disproportionately influences the model due to its scale, facilitating a more balanced and effective analysis. Moreover, min-max scaling aids in accelerating the convergence of gradient descent algorithms by ensuring that the feature space is uniformly scaled (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e3.4. Balancing Data through Sampling Methods\u003c/h2\u003e \u003cp\u003eSampling methods are essential for addressing data imbalance, ensuring that classification models maintain high accuracy and sensitivity in diabetes detection. The primary goal of these techniques is to rebalance the dataset to allow equitable learning from both classes (\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003e3.5. Overview of Sampling Approaches\u003c/h2\u003e \u003cp\u003eIn the sections that follow, we will delve into various sampling methods, such as oversampling, undersampling, and hybrid approaches. These strategies are critical for mitigating the challenges associated with imbalanced datasets in diabetes classification.\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section3\"\u003e \u003ch2\u003e3.5.1. Oversampling Methods\u003c/h2\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eRandom Over Sampling\u003c/b\u003e: This method involves randomly duplicating examples from the minority class to balance the class distribution. Although straightforward, it may result in overfitting by potentially replicating noise in the data (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eSMOTE (Synthetic Minority Over-sampling Technique)\u003c/b\u003e: Developed by Chawla et al. in SMOTE creates synthetic samples through interpolation between multiple minority class samples, helping to avoid overfitting by expanding the decision region for the minority class (\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eADASYN (Adaptive Synthetic Sampling Approach)\u003c/b\u003e: This technique, introduced by He et ADASYN focuses on generating synthetic data for samples that are difficult to classify, enhancing model generalization (\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eBorderline SMOTE\u003c/b\u003e: This variant targets minority class samples near the decision boundary, aiming to improve classification on challenging cases by generating synthetic samples along the borderline (\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eKMeans SMOTE\u003c/b\u003e: By integrating K-Means clustering with SMOTE, this method clusters the minority class before applying SMOTE within each cluster to produce more contextually relevant synthetic samples (\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eSmotified GAN (SMOTE and GAN)\u003c/b\u003e: This innovative approach combines Generative Adversarial Networks (GANs) with SMOTE to produce realistic synthetic samples, enhancing the diversity and authenticity of the data (\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section3\"\u003e \u003ch2\u003e3.5.2. Hybrid Sampling Methods\u003c/h2\u003e \u003cp\u003eHybrid sampling strategies merge the benefits of both oversampling and undersampling to balance the dataset while minimizing information loss and reducing the risk of overfitting.\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eSMOTEENN (SMOTE\u0026thinsp;+\u0026thinsp;Edited Nearest Neighbors)\u003c/b\u003e: This method pairs SMOTE with the undersampling technique Edited Nearest Neighbors (ENN) to refine synthetic samples by removing misclassified instances near their nearest neighbors (\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eSMOTETomek\u003c/b\u003e: Similar to SMOTEENN, SMOTETomek combines SMOTE with Tomek links to identify and eliminate close sample pairs from opposing classes, clarifying the decision boundary between classes (\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e).\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e3.6. Classifier Selection Post-Sampling for Diabetes Prediction\u003c/h2\u003e \u003cp\u003eWith the dataset now balanced, we proceed to select suitable machine learning classifiers. As \"Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\" illustrates, each algorithm has strengths and challenges that, when carefully paired with the newly equilibrated data, can yield a more accurate prediction model for diabetes. The Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e delineates the various classifiers considered for this study, post-sampling. It aids in understanding the potential impact of each classifier's advantages and disadvantages within the balanced data context, guiding our selection process towards the most effective algorithm for predicting the incidence of diabetes.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eIllustrates, each algorithm has strengths and challenges.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAdvantages\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDisadvantages\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLogistic Regression\u003c/p\u003e \u003cp\u003e(\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSimple to implement and interpret. Efficient to train. Good for binary classification.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAssumes linear relationship between variables. Not suitable for complex relationships.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eK-Nearest Neighbors (KNN)\u003c/p\u003e \u003cp\u003e(\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eNo assumption about data. Simple and effective. Adaptable to any type of data.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eComputationally expensive. Performance depends on the number of dimensions.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSupport Vector Machine (SVM) (\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEffective in high dimensional spaces. Memory efficient. Versatile with kernel functions.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRequires careful parameter tuning. Not suitable for large datasets.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDecision Tree Classifier\u003c/p\u003e \u003cp\u003e(\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEasy to interpret and visualize. Can handle both numerical and categorical data.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProne to overfitting. Can become unstable with small variations in data.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRandomForestClassifier\u003c/p\u003e \u003cp\u003e(\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHandles overfitting well. Works well on large datasets. Provides feature importances.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eCan be slow to predict. Complex and difficult to interpret.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAdaBoostClassifier\u003c/p\u003e \u003cp\u003e(\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eImproves classification accuracy. Flexible to combine with any learning algorithm.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSensitive to noisy data and outliers. Can overfit on very complex datasets.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGradient Boosting\u003c/p\u003e \u003cp\u003e(\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHighly effective and flexible. Can optimize on different loss functions.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eProne to overfitting without proper tuning. Time-consuming to train.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSGDClassifier\u003c/p\u003e \u003cp\u003e(\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEfficient for large-scale problems. Easy to implement and provides a lot of opportunities for code tuning.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSensitive to feature scaling. Requires a number of hyperparameters.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGaussanNB\u003c/p\u003e \u003cp\u003e(\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWorks well with high-dimensional data. Simple and fast.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAssumes that features are independent. Performance can be affected if the independence assumption is not met.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. Descriptive results:","content":"\u003cp\u003eIn total, 7408 people were included with a mean age of 46.55\u0026thinsp;\u0026plusmn;\u0026thinsp;8.89. 3,806 (%51.38) were male. The 5 years cumulative incidence of type 2 diabetes was (31.8, %95 CI: 27.9\u0026ndash;36.1) in 1000 population. The characteristics of the study population are shown in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eBassline Characteristics of the study population based on gender.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuantitative Variable\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSubgroup\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMean SD\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eP -Value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eAge\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e47.29\u0026thinsp;\u0026plusmn;\u0026thinsp;9.12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e45.87\u0026thinsp;\u0026plusmn;\u0026thinsp;8.60\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eDiastolic blood pressure\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e72.91\u0026thinsp;\u0026plusmn;\u0026thinsp;10.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e72.10\u0026thinsp;\u0026plusmn;\u0026thinsp;10.88\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eSystolic blood pressure\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e107.65\u0026thinsp;\u0026plusmn;\u0026thinsp;15.18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e106.28\u0026thinsp;\u0026plusmn;\u0026thinsp;15.19\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003ePulse rate\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e70.73\u0026thinsp;\u0026plusmn;\u0026thinsp;10.02\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e76.06\u0026thinsp;\u0026plusmn;\u0026thinsp;10.38\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eMedical Equation Task (MET)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e45.86\u0026thinsp;\u0026plusmn;\u0026thinsp;14.42\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e38.68\u0026thinsp;\u0026plusmn;\u0026thinsp;6.84\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eEnergy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3054.77\u0026thinsp;\u0026plusmn;\u0026thinsp;1173.33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e2862.42\u0026thinsp;\u0026plusmn;\u0026thinsp;1111.22\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eTriglyceride\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e132.82\u0026thinsp;\u0026plusmn;\u0026thinsp;86.12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e118.91\u0026thinsp;\u0026plusmn;\u0026thinsp;66.80\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eCholesterol\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e178.39\u0026thinsp;\u0026plusmn;\u0026thinsp;36.45\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e187.96\u0026thinsp;\u0026plusmn;\u0026thinsp;37.59\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eHDL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e47.23\u0026thinsp;\u0026plusmn;\u0026thinsp;14.23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e54.32\u0026thinsp;\u0026plusmn;\u0026thinsp;16.53\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eLDL\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e104.56\u0026thinsp;\u0026plusmn;\u0026thinsp;30.67\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e109.83\u0026thinsp;\u0026plusmn;\u0026thinsp;32.11\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eWaist hip ratio\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.39\u0026thinsp;\u0026plusmn;\u0026thinsp;0.64\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e2.84\u0026thinsp;\u0026plusmn;\u0026thinsp;0.45\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eBMR\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6628.84\u0026thinsp;\u0026plusmn;\u0026thinsp;984.17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5418.65\u0026thinsp;\u0026plusmn;\u0026thinsp;628.36\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCategorical Variable\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003eMale\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003eN(%)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003eFemale\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cb\u003eN(%)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003eP-value\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHave a diabetes\u003c/b\u003e\u003c/p\u003e \u003cp\u003eNo\u003c/p\u003e \u003cp\u003eyes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3,710(97.48)\u003c/p\u003e \u003cp\u003e96(2.52)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3,462(96.11)\u003c/p\u003e \u003cp\u003e140(3.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eBMI\u003c/b\u003e\u003c/p\u003e \u003cp\u003e1\u003c/p\u003e \u003cp\u003e2\u003c/p\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e797(20.94)\u003c/p\u003e \u003cp\u003e1,588(41.72)\u003c/p\u003e \u003cp\u003e1,421(37.33)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e292(8.10)\u003c/p\u003e \u003cp\u003e1,159(32.15)\u003c/p\u003e \u003cp\u003e2,151(59.68)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eSmoking\u003c/b\u003e\u003c/p\u003e \u003cp\u003eNo\u003c/p\u003e \u003cp\u003eyes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2,117) 55.62)\u003c/p\u003e \u003cp\u003e1,689(44.38)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e127(3.53)\u003c/p\u003e \u003cp\u003e3,475(96.47)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDrug users\u003c/b\u003e\u003c/p\u003e \u003cp\u003eNo\u003c/p\u003e \u003cp\u003eyes\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2,017(53.00)\u003c/p\u003e \u003cp\u003e1,789(47.00)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e3,583(99.47)\u003c/p\u003e \u003cp\u003e19(0.53)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eMarital status\u003c/b\u003e\u003c/p\u003e \u003cp\u003e1\u003c/p\u003e \u003cp\u003e2\u003c/p\u003e \u003cp\u003e3\u003c/p\u003e \u003cp\u003e4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e88(2.31)\u003c/p\u003e \u003cp\u003e3,701(97.24)\u003c/p\u003e \u003cp\u003e5(0.13)\u003c/p\u003e \u003cp\u003e12(0.32)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e246(6.83)\u003c/p\u003e \u003cp\u003e2,984 (82.84)\u003c/p\u003e \u003cp\u003e309(8.58)\u003c/p\u003e \u003cp\u003e63(1.75)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e"},{"header":"5. Evaluation Metrics","content":"\u003cp\u003eIn predictive modeling for conditions like diabetes where dataset imbalance is prevalent, reliance on standard accuracy metrics can be misleading. Therefore, we employ three alternative metrics: the F1 score, AUC, and G-means. Each metric provides a distinct perspective on model performance, addressing the issues inherent in imbalanced datasets (\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eF1 Score\u003c/strong\u003e \u003cp\u003eThe F1 score, calculated as 2 * (precision * recall) / (precision\u0026thinsp;+\u0026thinsp;recall), reflects the model's balance between precision (the proportion of true positives out of all positive predictions) and recall (the proportion of true positives out of actual positive cases). It's an important metric in medical predictions due to the high cost of false negatives and positives.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAUC\u003c/strong\u003e \u003cp\u003eAUC represents a model's ability to differentiate classes, ranging from 0.5 (no better than random) to 1 (perfect classification). It is summarized from the ROC curve, which plots sensitivity (true positive rate) against 1-specificity (false positive rate). AUC is favored in imbalanced datasets as it is not influenced by the skew in class distribution.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eGeometric Mean Score (G-means)\u003c/strong\u003e \u003cp\u003eThe G-means metric takes the square root of the product of sensitivity and specificity, effectively capturing a model's performance on both minority and majority classes. In mathematical terms, G-means\u0026thinsp;=\u0026thinsp;sqrt(sensitivity * specificity). It ensures that the model is not overly biased toward the predominant class, with higher values indicating a balanced classification performance.\u003c/p\u003e \u003c/p\u003e"},{"header":"6. Performance Analysis","content":"\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e6.1. Interpretation of Correlation Matrices\u003c/h2\u003e \u003cp\u003eThe correlation matrices provided offer an intricate look at the relationships between various features for diabetes prediction across three different groups: female, male, and the entire dataset. Below is an analysis of how these relationships manifest and what they reveal about diabetes prediction.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Figur3 show the correlation matrix between covariates and the incidence of diabetes in female and male are the same. A high correlation was found between diastolic blood pressure and systolic blood pressure, BMI and bmr, SGPT and SGOT, LDL and cholesterol.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e demonstrates the correlation matrix between covariates for the prediction of diabetes. The result shows that a high correlation was between gender and WHR, smoking and gender, cholesterol and LDL, SPGT and SGOT, diastolic blood pressure and systolic blood pressure.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003e6.2. Interpretation of Feature Importance Analysis\u003c/h2\u003e \u003cp\u003eThe feature importance plots (Figs.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e,\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e and \u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e) provide critical insights into the most influential factors affecting the appearance of diabetes among the different groups analyzed. Below is an interpretation of the feature importance results for male, female, and the entire data.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e shows all important features for prediction diabetes in female. This highlights that female whit high level of the Triglyceride (TG), bmr, cholesterol, Energy intake, HDL, BMI, GGT, LDL, SGPT, SGOT, PR, MET, ALP, assert index, BUN, age and euduation are the primary drivers of diabetes risk.\u003c/p\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e indicates the important features that predict diabetes in male. According to the results, the most importance variable is BMI as follow: SGOT, GGT, ALP, socioeconomic status, TG, cholesterol, BMR, LDL, HDL-C, SGPT, MET, Energy, age, BUN and others that show in this figure.\u003c/p\u003e \u003cp\u003eIn entire data, the results show that BMI is the most important variable for the prediction of diabetes. Other variables respectively include SGOT, bmr, Energy, GGT, TG, LDL, cholesterol, APL, HDL, socioeconomic status, SPGT and age. Other variables are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eThese results highlight that important variables differ in male and female and total population. These insights can help guide targeted prevention and management strategies for diabetes based on gender-specific risk profiles.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"7. Selection of Optimal Classifier-Sampling Combinations","content":"\u003cp\u003eAll possible combinations of the introduced resampling methods and classifiers, in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, were implemented on the study data. The acceptable outcomes show in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eThe analysis reveals that resampling techniques, particularly RandomOverSampling, SMOTE, ADASYN, and KMeansSMOTE, significantly enhance the performance of machine learning models in handling class imbalance. The Multi-Layer Perceptron (MLP) consistently demonstrates superior performance across various resampling methods, indicating its robustness and adaptability in complex data scenarios.\u003c/p\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003e7.1. Impact of Resampling Techniques\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eRandomOverSampling\u003c/strong\u003e \u003cp\u003eThis technique, particularly when paired with MLP, achieves the highest F1 score of 82.97\u0026thinsp;\u0026plusmn;\u0026thinsp;2.46, along with a strong AUC of 89.25\u0026thinsp;\u0026plusmn;\u0026thinsp;1.57 and G-Mean of 88.73\u0026thinsp;\u0026plusmn;\u0026thinsp;1.75. The exceptional performance of RandomOverSampling with MLP suggests that simply increasing the representation of minority class examples can significantly improve model training, especially in neural network-based models. This highlights the potential of RandomOverSampling in scenarios where the model architecture can effectively leverage the additional data without overfitting.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eSMOTE and ADASYN\u003c/strong\u003e \u003cp\u003eBoth SMOTE and ADASYN show strong performance improvements, particularly with MLP and RandomForest. For instance, SMOTE with MLP results in an F1 score of 79.85\u0026thinsp;\u0026plusmn;\u0026thinsp;3.91, AUC of 89.7\u0026thinsp;\u0026plusmn;\u0026thinsp;2.54, and G-Mean of 89.31\u0026thinsp;\u0026plusmn;\u0026thinsp;2.78. ADASYN with MLP achieves the highest F1 score among the ADASYN combinations, with an F1 score of 82.17\u0026thinsp;\u0026plusmn;\u0026thinsp;3.38, AUC of 89.61\u0026thinsp;\u0026plusmn;\u0026thinsp;2.09, and G-Mean of 89.15\u0026thinsp;\u0026plusmn;\u0026thinsp;2.31. These techniques not only address the imbalance by generating synthetic examples but also enhance the model's ability to generalize, as evidenced by the high AUC and G-means scores. The consistent performance gains suggest that the synthetic data generated by these methods provides meaningful and varied examples that help the model better understand the decision boundaries.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eKMeansSMOTE\u003c/strong\u003e \u003cp\u003eThe robust results achieved with KMeansSMOTE, especially with MLP, emphasize the importance of intelligently generating synthetic samples. KMeansSMOTE with MLP achieves an F1 score of 78.33\u0026thinsp;\u0026plusmn;\u0026thinsp;6.98, AUC of 88.25\u0026thinsp;\u0026plusmn;\u0026thinsp;2.25, and G-Mean of 87.73\u0026thinsp;\u0026plusmn;\u0026thinsp;2.42. By clustering the data before generating synthetic samples, KMeansSMOTE ensures that the new data points are more representative of the underlying distribution, thereby enhancing model performance in terms of both F1 score and AUC.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eThe optimal classifier sampling combination results.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMachine Learning Models\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eResampling techniques\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003estep\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003ef1\u003c/p\u003e \u003cp\u003e(mean\u0026thinsp;\u0026plusmn;\u0026thinsp;sd)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003cp\u003e(mean\u0026thinsp;\u0026plusmn;\u0026thinsp;sd)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eG-means\u003c/p\u003e \u003cp\u003e(mean\u0026thinsp;\u0026plusmn;\u0026thinsp;sd)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eRandomForest\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eSMOTEENN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e70.77\u0026thinsp;\u0026plusmn;\u0026thinsp;3.57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e77.44\u0026thinsp;\u0026plusmn;\u0026thinsp;2.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e74.03\u0026thinsp;\u0026plusmn;\u0026thinsp;2.89\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e78.27\u0026thinsp;\u0026plusmn;\u0026thinsp;1.54\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e87.18\u0026thinsp;\u0026plusmn;\u0026thinsp;1.12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e86.47\u0026thinsp;\u0026plusmn;\u0026thinsp;1.28\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eMLP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eSMOTEENN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e42.22\u0026thinsp;\u0026plusmn;\u0026thinsp;5.39\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e64.63\u0026thinsp;\u0026plusmn;\u0026thinsp;2.24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e54.88\u0026thinsp;\u0026plusmn;\u0026thinsp;3.94\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e71.33\u0026thinsp;\u0026plusmn;\u0026thinsp;1.99\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e88.76\u0026thinsp;\u0026plusmn;\u0026thinsp;1.17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e88.52\u0026thinsp;\u0026plusmn;\u0026thinsp;1.28\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eGradient Boosting\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eSMOTE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e65.77\u0026thinsp;\u0026plusmn;\u0026thinsp;1.77\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e74.65\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e70.23\u0026thinsp;\u0026plusmn;\u0026thinsp;1.46\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e71.63\u0026thinsp;\u0026plusmn;\u0026thinsp;2.42\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e85.47\u0026thinsp;\u0026plusmn;\u0026thinsp;1.66\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e84.71\u0026thinsp;\u0026plusmn;\u0026thinsp;1.88\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eRandomForest\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eSMOTE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e70.77\u0026thinsp;\u0026plusmn;\u0026thinsp;3.57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e77.44\u0026thinsp;\u0026plusmn;\u0026thinsp;2.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e74.03\u0026thinsp;\u0026plusmn;\u0026thinsp;2.89\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e82.18\u0026thinsp;\u0026plusmn;\u0026thinsp;2.76\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e85.97\u0026thinsp;\u0026plusmn;\u0026thinsp;1.84\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e84.85\u0026thinsp;\u0026plusmn;\u0026thinsp;2.14\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eDecisionTreeClassifier\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eSMOTE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e62.01\u0026thinsp;\u0026plusmn;\u0026thinsp;2.95\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e80.69\u0026thinsp;\u0026plusmn;\u0026thinsp;1.71\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e79.33\u0026thinsp;\u0026plusmn;\u0026thinsp;2.01\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e63.88\u0026thinsp;\u0026plusmn;\u0026thinsp;2.84\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e83.59\u0026thinsp;\u0026plusmn;\u0026thinsp;2.11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e82.84\u0026thinsp;\u0026plusmn;\u0026thinsp;2.38\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eMLP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eSMOTE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e40.60\u0026thinsp;\u0026plusmn;\u0026thinsp;5.43\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e63.8\u0026thinsp;\u0026plusmn;\u0026thinsp;2.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e53.28\u0026thinsp;\u0026plusmn;\u0026thinsp;4.13\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e79.85\u0026thinsp;\u0026plusmn;\u0026thinsp;3.91\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e89.7\u0026thinsp;\u0026plusmn;\u0026thinsp;2.54\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e89.31\u0026thinsp;\u0026plusmn;\u0026thinsp;2.78\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eRandomForest\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eRandomOverSampling\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e70.77\u0026thinsp;\u0026plusmn;\u0026thinsp;3.57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e77.44\u0026thinsp;\u0026plusmn;\u0026thinsp;2.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e74.03\u0026thinsp;\u0026plusmn;\u0026thinsp;2.89\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e78.56\u0026thinsp;\u0026plusmn;\u0026thinsp;3.60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e82.72\u0026thinsp;\u0026plusmn;\u0026thinsp;2.51\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e80.85\u0026thinsp;\u0026plusmn;\u0026thinsp;3.11\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eMLP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eRandomOverSampling\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e42.25\u0026thinsp;\u0026plusmn;\u0026thinsp;2.41\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e64.64\u0026thinsp;\u0026plusmn;\u0026thinsp;1.17\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e54.99\u0026thinsp;\u0026plusmn;\u0026thinsp;2.14\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e82.97\u0026thinsp;\u0026plusmn;\u0026thinsp;2.46\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e89.25\u0026thinsp;\u0026plusmn;\u0026thinsp;1.57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e88.73\u0026thinsp;\u0026plusmn;\u0026thinsp;1.75\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eGradient Boosting\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eADASYN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e65.77\u0026thinsp;\u0026plusmn;\u0026thinsp;1.77\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e74.65\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e70.23\u0026thinsp;\u0026plusmn;\u0026thinsp;1.46\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e68.23\u0026thinsp;\u0026plusmn;\u0026thinsp;0.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e85.17\u0026thinsp;\u0026plusmn;\u0026thinsp;1.29\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e84.52\u0026thinsp;\u0026plusmn;\u0026thinsp;1.49\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eRandomForest\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eADASYN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e70.77\u0026thinsp;\u0026plusmn;\u0026thinsp;3.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e77.44\u0026thinsp;\u0026plusmn;\u0026thinsp;2.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e74.03\u0026thinsp;\u0026plusmn;\u0026thinsp;2.89\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e81.24\u0026thinsp;\u0026plusmn;\u0026thinsp;3.47\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e86.02\u0026thinsp;\u0026plusmn;\u0026thinsp;2.42\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e84.93\u0026thinsp;\u0026plusmn;\u0026thinsp;2.84\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eMLP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eADASYN\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e40.24\u0026thinsp;\u0026plusmn;\u0026thinsp;4.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e63.52\u0026thinsp;\u0026plusmn;\u0026thinsp;2.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e52.7\u0026thinsp;\u0026plusmn;\u0026thinsp;3.6\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e82.17\u0026thinsp;\u0026plusmn;\u0026thinsp;3.38\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e89.61\u0026thinsp;\u0026plusmn;\u0026thinsp;2.09\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e89.15\u0026thinsp;\u0026plusmn;\u0026thinsp;2.31\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eGradient Boosting\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eKMeansSMOTE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e65.77\u0026thinsp;\u0026plusmn;\u0026thinsp;1.77\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e74.65\u0026thinsp;\u0026plusmn;\u0026thinsp;1.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e70.23\u0026thinsp;\u0026plusmn;\u0026thinsp;1.46\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e69.08\u0026thinsp;\u0026plusmn;\u0026thinsp;4.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e77.04\u0026thinsp;\u0026plusmn;\u0026thinsp;2.66\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e73.53\u0026thinsp;\u0026plusmn;\u0026thinsp;3.52\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eRandomForest\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eKMeansSMOTE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e70.77\u0026thinsp;\u0026plusmn;\u0026thinsp;3.57\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e77.44\u0026thinsp;\u0026plusmn;\u0026thinsp;2.15\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e74.03\u0026thinsp;\u0026plusmn;\u0026thinsp;2.89\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e74.66\u0026thinsp;\u0026plusmn;\u0026thinsp;4.36\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e79.92\u0026thinsp;\u0026plusmn;\u0026thinsp;2.82\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e77.28\u0026thinsp;\u0026plusmn;\u0026thinsp;3.59\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eMLP\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eKMeansSMOTE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ebefore\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e38.66\u0026thinsp;\u0026plusmn;\u0026thinsp;4.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e62.84\u0026thinsp;\u0026plusmn;\u0026thinsp;1.97\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e51.32\u0026thinsp;\u0026plusmn;\u0026thinsp;4.03\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eafter\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e78.33\u0026thinsp;\u0026plusmn;\u0026thinsp;6.98\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e88.25\u0026thinsp;\u0026plusmn;\u0026thinsp;2.25\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e87.73\u0026thinsp;\u0026plusmn;\u0026thinsp;2.42\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003e7.2. Model-Specific Insights\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eMulti-Layer Perceptron (MLP)\u003c/strong\u003e \u003cp\u003eMLP stands out as the most effective model across various resampling techniques, consistently achieving high F1 scores, AUC, and G-means. These results are illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e. For example, MLP with RandomOverSampling achieves an F1 score of 82.97\u0026thinsp;\u0026plusmn;\u0026thinsp;2.46, AUC of 89.25\u0026thinsp;\u0026plusmn;\u0026thinsp;1.57, and G-Mean of 88.73\u0026thinsp;\u0026plusmn;\u0026thinsp;1.75. This indicates that neural networks, with their capacity to model complex relationships, benefit significantly from balanced datasets. The adaptability of MLP to various resampling methods underscores its potential as a versatile tool in predictive modeling for imbalanced data.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eRandomForest\u003c/strong\u003e \u003cp\u003eFigure\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e indicates that the RandomForest model also demonstrates substantial improvements with resampling techniques, particularly with RandomOverSampling and ADASYN. For instance, RandomForest with ADASYN achieves an F1 score of 81.24\u0026thinsp;\u0026plusmn;\u0026thinsp;3.47, AUC of 86.02\u0026thinsp;\u0026plusmn;\u0026thinsp;2.42, and G-Mean of 84.93\u0026thinsp;\u0026plusmn;\u0026thinsp;2.84. The inherent ability of RandomForest to handle variability and reduce overfitting makes it well-suited to benefit from the additional or synthetic samples provided by resampling methods.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003e7.3. Broader Implications\u003c/h2\u003e \u003cp\u003e \u003cstrong\u003eImproving Predictive Reliability\u003c/strong\u003e \u003cp\u003eThe substantial improvements in F1 score, AUC, and G-means across most resampling techniques underscore the critical role of data balancing in predictive modeling. By addressing class imbalance, these methods not only improve the accuracy of predictions but also enhance the reliability and robustness of the models. This is particularly important in medical applications where predictive accuracy can directly impact patient outcomes.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003cstrong\u003eAlgorithm and Technique Selection\u003c/strong\u003e \u003cp\u003eThe findings suggest a strategic approach to selecting resampling techniques and machine learning models based on the specific characteristics of the dataset and the desired outcomes. For example, in scenarios where neural networks are preferred, RandomOverSampling or SMOTE with MLP might be the optimal choice. Conversely, for tree-based models, ADASYN might offer significant performance benefits.\u003c/p\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eTo evaluate the performance of the combination of various resampling techniques and the MLP model, the Receiver Operating Characteristic (ROC) curves and loss functions before and after the combination are illustrated in Figs.\u0026nbsp;\u003cspan refid=\"Fig10\" class=\"InternalRef\"\u003e10\u003c/span\u003e and \u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e11\u003c/span\u003e, respectively.\u003c/p\u003e \u003cp\u003eIn Fig.\u0026nbsp;\u003cspan refid=\"Fig11\" class=\"InternalRef\"\u003e11\u003c/span\u003e, loss trends for different sampling methods used with the MLP classifier trained with a learning rate of 0.0005, 150 epochs, a validation split of 0.2, and a binary cross-entropy loss function. Across all sampling methods, the training and validation loss curves steadily decline and converge towards low loss values, indicating that each approach helped reduce data imbalance and enhanced the classifier's ability to learn from and generalize to the validation set. Minor spikes represent the natural fluctuations of complex learning processes. Overall five sampling methods improved the model\u0026rsquo;s predictive accuracy and generalization, as reflected by the consistent loss trends.\u003c/p\u003e \u003c/div\u003e"},{"header":"8. Conclusion","content":"\u003cp\u003eThis study explored the predictive power of machine learning models combined with advanced data balancing techniques to forecast diabetes incidence in an adult cohort over a 5-year period. Resampling methods like SMOTE, ADASYN, RandomOverSampling, and KMeansSMOTE effectively improved model performance, addressing the challenge of data imbalance.\u003c/p\u003e \u003cp\u003ePost-sampling, most models showed enhanced predictive accuracy, particularly in F1 scores and AUC measures. RandomOverSampling with MLP and ADASYN with MLP were identified as the most effective pairings, achieving significant gains in AUC, F1, and G-means scores. Additionally, the RandomOverSampling with RandomForest combination effectively addressed class imbalance, demonstrating notable improvements in predictive performance.\u003c/p\u003e \u003cp\u003eThese findings underscore the importance of balancing techniques in medical data analysis, providing a clear pathway to develop more reliable predictive models. Future research will focus on feature selection methods, particularly leveraging autoencoders for dimensionality reduction and feature extraction. Finally, refining algorithm-level approaches for handling imbalanced data will include integrating ensemble learning with specialized cost-sensitive classifiers that prioritize the minority class. Techniques such as hybrid ensemble methods that combine boosting and bagging, or innovative architectures like one-class neural networks, could be explored for better detection of diabetic cases. Furthermore, incorporating reinforcement learning for adaptive resampling strategies may provide a dynamic approach to data balancing. Also, these findings can help guide targeted prevention and management strategies for prevention and control diabetes based on gender-specific risk profiles .\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData Availability Statement:\u0026nbsp;\u003c/strong\u003eData can be inquired from the corresponding author.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions: MT, ZA\u0026nbsp;\u003c/strong\u003eand\u003cstrong\u003e\u0026nbsp;YJH:\u0026nbsp;\u003c/strong\u003eproviding the main idea of study and methodology,\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003efinal analysis, developing the idea and revising the final manuscript, \u003cstrong\u003eMKH\u0026nbsp;\u003c/strong\u003eand \u003cstrong\u003eMSH :\u0026nbsp;\u003c/strong\u003edeveloping the idea and revising the final manuscript, contributed to data analysis and revising the final manuscript. \u003cstrong\u003eADH\u0026nbsp;\u003c/strong\u003eand\u003cstrong\u003e\u0026nbsp;GHN\u003c/strong\u003e revised the final manuscript. All\u0026nbsp;authors approved the final version of the manuscript that is submitted.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflict of interest:\u0026nbsp;\u003c/strong\u003eThe authors declare that there is no conflict of interest.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate:\u0026nbsp;\u003c/strong\u003eEthical issues including plagiarism, informed consent, misconduct, data fabrication and/or falsification, double publication and/or submission, redundancy, etc. were completely observed by the authors. This study was performed according to the ethical guidelines expressed in the Declaration of Helsinki and the Strengthening of the Reporting of Observational Studies in Epidemiology (STORB) guideline. The study was also approved by the\u0026nbsp;Research Ethics Committee of Fasa University of Medical Sciences (IR.FUMS.REC.1402.172).\u0026nbsp;Informed consent was also waived by the Research Ethics Committee of Fasa University of Medical Sciences (IR.FUMS.REC.1402.172).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding:\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003eFasa University of Medical Sciences.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication:\u0026nbsp;\u003c/strong\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgments\u003c/strong\u003e\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003eWe would also like to thank Fasa University of Medical Sciences for supporting this research.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eHameed I, Masoodi SR, Mir SA, Nabi M, Ghazanfar K, Ganai BA. Type 2 diabetes mellitus: from a metabolic disorder to an inflammatory condition. World J diabetes. 2015;6(4):598.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKaze AD, Jaar BG, Fonarow GC, Echouffo-Tcheugui JB. Diabetic kidney disease and risk of incident stroke among adults with type 2 diabetes. BMC Med. 2022;20(1):127.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSattar N, Presslie C, Rutter MK, McGuire DK. Cardiovascular and Kidney Risks in Individuals With Type 2 Diabetes: Contemporary Understanding With Greater Emphasis on Excess Adiposity. Diabetes Care. 2024:dci230041.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSaeedi P, Petersohn I, Salpea P, Malanda B, Karuranga S, Unwin N et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas. Diabetes research and clinical practice. 2019;157:107843.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSafiri S, Karamzad N, Kaufman JS, Bell AW, Nejadghaderi SA, Sullman MJ, et al. Prevalence, deaths and disability-adjusted-life-years (DALYs) due to type 2 diabetes and its attributable risk factors in 204 countries and territories, 1990\u0026ndash;2019: results from the global burden of disease study 2019. Front Endocrinol. 2022;13:838027.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDagliati A, Marini S, Sacchi L, Cogni G, Teliti M, Tibollo V, et al. Machine learning methods to predict diabetes complications. J Diabetes Sci Technol. 2018;12(2):295\u0026ndash;302.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlghamdi T. Prediction of diabetes complications using computational intelligence techniques. Appl Sci. 2023;13(5):3030.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDutta A, Hasan MK, Ahmad M, Awal MA, Islam MA, Masud M, et al. Early prediction of diabetes using an ensemble of machine learning models. Int J Environ Res Public Health. 2022;19(19):12378.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShin J, Kim J, Lee C, Yoon JY, Kim S, Song S, et al. Development of various diabetes prediction models using machine learning techniques. Diabetes Metabolism J. 2022;46(4):650.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLyra S, Leonhardt S, Antink CH, editors. Early prediction of sepsis using random forest classification for imbalanced clinical data. IEEE; 2019. 2019 Computing in Cardiology (CinC).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng. 2009;21(9):1263\u0026ndash;84.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eL\u0026oacute;pez V, Fern\u0026aacute;ndez A, Moreno-Torres JG, Herrera F. Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics. Expert Syst Appl. 2012;39(7):6585\u0026ndash;608.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar M, Sheshadri H. On the classification of imbalanced datasets. Int J Comput Appl. 2012;44(8):1\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSun Y, Wong AK, Kamel MS. Classification of imbalanced data: A review. Int J Pattern recognit Artif Intell. 2009;23(04):687\u0026ndash;719.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChawla NV, Japkowicz N, Kotcz A. Special issue on learning from imbalanced data sets. ACM SIGKDD Explorations Newsl. 2004;6(1):1\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Q, Zhang M, He Y, Zhang L, Zou J, Yan Y, et al. Predicting the risk of incident type 2 diabetes mellitus in Chinese elderly using machine learning techniques. J Personalized Med. 2022;12(6):905.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSadeghi S, Khalili D, Ramezankhani A, Mansournia MA, Parsaeian M. Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med Inf Decis Mak. 2022;22(1):36.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHassan MM, Amiri N. Classification of imbalanced data of diabetes disease using machine learning algorithms. Age (years). 2019;21(81):3324.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKhushi M, Shaukat K, Alam TM, Hameed IA, Uddin S, Luo S, et al. A comparative performance analysis of data resampling methods on imbalance medical data. IEEE Access. 2021;9:109960\u0026ndash;75.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKumar MS, Khan MZ, Rajendran S, Noor A, Dass AS, Prabhu J. Imbalanced classification in diabetics using ensembled machine learning. Computers Mater Continua. 2022;72(3):4397\u0026ndash;409.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAwe OO, Ojumu JB, Ayanwoye GA, Ojumoola JS, Dias R. Machine Learning Approaches for Handling Imbalances in Health Data Classification. Sustainable Statistical and Data Science Methods and Practices: Reports from LISA 2020 Global Network, Ghana, 2022: Springer; 2024. pp. 375\u0026thinsp;\u0026ndash;\u0026thinsp;91.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNugraha W, Maulana R, Latifah L, Rahayuningsih PA, Nurmalasari N, editors. Over-sampling strategies with data cleaning for handling imbalanced problems for diabetes prediction. AIP Conference Proceedings; 2023: AIP Publishing.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHairani Hairani H, Dadang Priyanto D. A New Approach of Hybrid Sampling SMOTE and ENN to the Accuracy of Machine Learning Methods on Unbalanced. Diabetes Disease Data. 2023;14(8):585\u0026ndash;890. A new approach of hybrid sampling SMOTE and ENN to the accuracy of machine learning methods on unbalanced diabetes disease data.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHomayounfar R, Farjam M, Bahramali E, Sharafi M, Poustchi H, Malekzadeh R, et al. Cohort Profile: The Fasa Adults Cohort Study (FACS): a prospective study of non-communicable diseases risks. Int J Epidemiol. 2023;52(3):e172\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFarjam M, Bahrami H, Bahramali E, Jamshidi J, Askari A, Zakeri H, et al. A cohort study protocol to analyze the predisposing factors to common chronic non-communicable diseases in rural areas: Fasa Cohort Study. BMC Public Health. 2016;16:1\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAhuja V, Aronen P, Pramodkumar TA, Looker H, Chetrit A, Bloigu AH, et al. Accuracy of 1-Hour Plasma Glucose During the Oral Glucose Tolerance Test in Diagnosis of Type 2 Diabetes in Adults: A Meta-analysis. Diabetes Care. 2021;44(4):1062\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShantal M, Othman Z, Bakar AA. A Novel Approach for Data Feature Weighting Using Correlation Coefficients and Min\u0026ndash;Max Normalization. Symmetry. 2023;15(12):2185.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChowdhury MM, Ayon RS, Hossain MS. An investigation of machine learning algorithms and data augmentation techniques for diabetes diagnosis using class imbalanced BRFSS dataset. Healthc Analytics. 2024;5:100297.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang C, Fridgeirsson EA, Kors JA, Reps JM, Rijnbeek PR. Impact of random oversampling and random undersampling on the performance of prediction models developed using observational health data. J Big Data. 2024;11(1):7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRamezankhani A, Pournik O, Shahrabi J, Azizi F, Hadaegh F, Khalili D. The impact of oversampling with SMOTE on the performance of 3 classifiers in prediction of type 2 diabetes. Med Decis Making. 2016;36(1):137\u0026ndash;44.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHe H, Bai Y, Garcia EA, Li S, editors. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence); 2008: Ieee.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMohanty MN. Advances in intelligent computing and communication. Springer; 2021.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDouzas G, Bacao F, Last F. Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE. Inf Sci. 2018;465:1\u0026ndash;20.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSharma A, Singh PK, Chandra R. SMOTified-GAN for class imbalanced pattern classification problems. Ieee Access. 2022;10:30655\u0026ndash;65.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMuntasir Nishat M, Faisal F, Jahan Ratul I, Al-Monsur A, Ar-Rafi AM, Nasrullah SM, et al. A comprehensive investigation of the performances of different machine learning classifiers with SMOTE-ENN oversampling technique and hyperparameter optimization for imbalanced heart failure dataset. Sci Program. 2022;2022:1\u0026ndash;17.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang Z, Wu C, Zheng K, Niu X, Wang X. SMOTETomek-based resampling for personality recognition. Ieee Access. 2019;7:129678\u0026ndash;89.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTu JV. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. J Clin Epidemiol. 1996;49(11):1225\u0026ndash;31.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eImandoust SB, Bolandraftar M. Application of k-nearest neighbor (knn) approach for predicting economic events: Theoretical background. Int J Eng Res Appl. 2013;3(5):605\u0026ndash;10.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBurbidge R, Buxton B. An introduction to support vector machines for data mining. Keynote papers, young OR12. 2001:3\u0026ndash;15.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKalcheva N, Todorova M, Marinova G, editors. Naive Bayes Classifier, Decision Tree and AdaBoost Ensemble Algorithm\u0026ndash;Advantages and Disadvantages. Proceedings of the 6th ERAZ Conference Proceedings (part of ERAZ conference collection), Online; 2020.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAria M, Cuccurullo C, Gnasso A. A comparison among interpretative proposals for Random Forests. Mach Learn Appl. 2021;6:100094.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHao L, Huang G. An improved AdaBoost algorithm for identification of lung cancer based on electronic nose. Heliyon. 2023;9(3).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAhn JM, Kim J, Kim K. Ensemble machine learning of gradient boosting (XGBoost, LightGBM, CatBoost) and attention-based CNN-LSTM for harmful algal blooms forecasting. Toxins. 2023;15(10):608.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eElmogy AM, Tariq U, Ammar M, Ibrahim A. Fake reviews detection using supervised machine learning. Int J Adv Comput Sci Appl. 2021;12(1).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSingh SK, Taylor RW, Pradhan B, Shirzadi A, Pham BT. Predicting sustainable arsenic mitigation using machine learning techniques. Ecotoxicol Environ Saf. 2022;232:113271.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSusan S, Kumar A. The balancing trick: Optimized sampling of imbalanced datasets\u0026mdash;A brief survey of the recent State of the Art. Eng Rep. 2021;3(4):e12298.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-research-methodology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bmrm","sideBox":"Learn more about [BMC Medical Research Methodology](http://bmcmedresmethodol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bmrm/default.aspx","title":"BMC Medical Research Methodology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Imbalanced datasets, Diabetes prediction, Machine learning, Artificial intelligence, Data-level method, Algorithm-level method","lastPublishedDoi":"10.21203/rs.3.rs-4772777/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4772777/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eImbalanced datasets pose significant challenges in predictive modeling, leading to biased outcomes and reduced model reliability. This study addresses data imbalance in diabetes prediction using machine learning techniques. Utilizing data from the Fasa Adult Cohort Study (FACS) with a 5-year follow-up of 10,000 participants, we developed predictive models for Type 2 diabetes.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eWe employed various data-level and algorithm-level interventions, including SMOTE, ADASYN, SMOTEENN and KMeans SMOTE, paired with Random Forest, Gradient Boosting, and Multi-Layer Perceptron (MLP). Performance was evaluated using F1 score, AUC, and G-means.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eOur results show that ADASYN with MLP achieved an F1 score of 82.17\u0026thinsp;\u0026plusmn;\u0026thinsp;3.38, AUC of 89.61\u0026thinsp;\u0026plusmn;\u0026thinsp;2.09, and G-means of 89.15\u0026thinsp;\u0026plusmn;\u0026thinsp;2.31. SMOTE with MLP followed closely with an F1 score of 79.85\u0026thinsp;\u0026plusmn;\u0026thinsp;3.91, AUC of 89.7\u0026thinsp;\u0026plusmn;\u0026thinsp;2.54, and G-means of 89.31\u0026thinsp;\u0026plusmn;\u0026thinsp;2.78. The SMOTEENN with Random Forest combination achieved an F1 score of 78.27\u0026thinsp;\u0026plusmn;\u0026thinsp;1.54, AUC of 87.18\u0026thinsp;\u0026plusmn;\u0026thinsp;1.12, and G-means of 86.47\u0026thinsp;\u0026plusmn;\u0026thinsp;1.28.\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eThese combinations effectively address class imbalance, improving the accuracy and reliability of diabetes predictions. The findings highlight the importance of using appropriate data-balancing techniques in medical data analysis.\u003c/p\u003e","manuscriptTitle":"Predicting Diabetes in Adults: Identifying Important Features in Unbalanced Data Over a 5-Year Cohort Study Using Machine Learning Algorithm","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-27 11:38:22","doi":"10.21203/rs.3.rs-4772777/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-08-16T09:14:23+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-08-12T07:42:48+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-08-05T03:32:41+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-07-30T16:25:05+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"99495903744375761447612699550539420371","date":"2024-07-27T03:54:26+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"63016044496731631223442012518272640488","date":"2024-07-26T13:15:19+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"60696277950743245478191931467579150279","date":"2024-07-25T19:05:22+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"142431661919129419540375005989905046972","date":"2024-07-25T17:39:39+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"69852970193278203352678642459014416767","date":"2024-07-25T05:32:56+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"280685211397260074290836413832332633288","date":"2024-07-25T04:13:23+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"338246986828925671427984401971158549054","date":"2024-07-25T03:53:22+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-07-25T03:50:42+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-07-24T18:13:33+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-07-24T18:12:48+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Medical Research Methodology","date":"2024-07-20T10:57:13+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-medical-research-methodology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bmrm","sideBox":"Learn more about [BMC Medical Research Methodology](http://bmcmedresmethodol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bmrm/default.aspx","title":"BMC Medical Research Methodology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"ae610c8a-e935-471e-ac7f-1e026d9e8fbd","owner":[],"postedDate":"August 27th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-09-30T16:10:42+00:00","versionOfRecord":{"articleIdentity":"rs-4772777","link":"https://doi.org/10.1186/s12874-024-02341-z","journal":{"identity":"bmc-medical-research-methodology","isVorOnly":false,"title":"BMC Medical Research Methodology"},"publishedOn":"2024-09-27 15:57:50","publishedOnDateReadable":"September 27th, 2024"},"versionCreatedAt":"2024-08-27 11:38:22","video":"","vorDoi":"10.1186/s12874-024-02341-z","vorDoiUrl":"https://doi.org/10.1186/s12874-024-02341-z","workflowStages":[]},"version":"v1","identity":"rs-4772777","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4772777","identity":"rs-4772777","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0