Predicting Diabetes Mellitus using Conditional Tabular Generative Adversarial Networks combined with MLP based on Body Composition Data

doi:10.21203/rs.3.rs-7344799/v1

Predicting Diabetes Mellitus using Conditional Tabular Generative Adversarial Networks combined with MLP based on Body Composition Data

2025 · doi:10.21203/rs.3.rs-7344799/v1

preprint OA: closed

Full text JSON View at publisher

Full text 168,888 characters · extracted from preprint-html · click to expand

Predicting Diabetes Mellitus using Conditional Tabular Generative Adversarial Networks combined with MLP based on Body Composition Data | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Predicting Diabetes Mellitus using Conditional Tabular Generative Adversarial Networks combined with MLP based on Body Composition Data Javad Hassannataj Joloudari, Mohammad Maftoun, Mohammad Ali Nematollahi, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7344799/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 10 Dec, 2025 Read the published version in Scientific Reports → Version 1 posted 18 You are reading this latest preprint version Abstract Accurately assessing the risk of diabetes is essential for early intervention and effective management. This study explores the potential of Machine Learning (ML) and Deep Learning (DL) models to analyze body composition measurements as predictors for diabetes screening. We begin by carefully preprocessing the dataset, handling missing values, encoding categorical variables, and classifying features to prepare the data for modeling. To enhance the dataset and improve model generalization, we implemented Conditional Tabular Generative Adversarial Networks (CTGAN) for data augmentation. The dataset is then split using stratified five-fold cross-validation to ensure balanced and reliable evaluation. We evaluate ten different ML models simultaneously, such as Multilayer Perceptron (MLP), Gradient Boosting, Random Forest, Logistic Regression, Decision Tree, LightGBM, TabNet, XGBoost, AdaBoost, and Linear Discriminant Analysis (LDA). The proposed approach, which integrates CTGAN-based augmentation with these diverse models, achieves strong predictive results. Among the models tested, MLP stands out with the best performance, reaching an accuracy of 93.91%. Other metrics also confirm its strength: AUROC at 93.87%, precision at 94.48%, recall at 93.87%, F1 score at 93.89%, Matthews Correlation Coefficient at 88.34%, and geometric mean at 93.71%. These results demonstrate that our combined methodology effectively captures complex relationships within body composition data and offers a reliable tool to support clinical decision-making in diabetes risk assessment. Future work may integrate additional clinical parameters to further enhance prediction accuracy and applicability in real-world settings. Biological sciences/Computational biology and bioinformatics Health sciences/Diseases Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research Diabetes prediction Body composition data CTGAN Machine learning Deep learning Multilayer perceptron Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction Diabetes is a chronic metabolic condition where the body fails to effectively regulate blood sugar levels due to insufficient insulin production or insulin resistance [ 1 ]. Its prevalence continues to increase due to sedentary lifestyles, unhealthy diets, and genetic predispositions. Diabetes is classified mainly into three types: Type 1 (T1D), Type 2 (T2D), and Gestational Diabetes (GD). T1D generally develops from pancreatic beta cell loss, mainly in children [ 2 , 3 ]. T2D arises when the body cannot produce enough insulin to maintain normal blood sugar levels, primarily affecting adults [ 4 ]. GD is a temporary condition during pregnancy that usually resolves after delivery, although it may increase the risk of T2D later in life [ 5 ]. T2D accounts for 90% of all diabetes cases. Symptoms vary according to the type and severity, but common ones include polyuria, polydipsia, polyphagia, weight loss, blurred vision, and slow-healing wounds [ 6 ]. According to the International Diabetes Federation (IDF) [ 7 ] and the World Health Organization (WHO) [ 8 ], 537 million adults (20–79 years) had diabetes in 2021. This figure is projected to increase to 643 million by 2030 and 783 million by 2045 [ 7 ]. In 2021, around 6.7 million deaths were attributed to diabetes, making it a major cause of mortality. The economic burden is vast, with healthcare costs exceeding $ 966 billion—a 316% rise over the past 15 years [ 8 ]. Additionally, 44.7% of diabetic adults remain undiagnosed, resulting in severe complications due to delayed care. Early detection is thus vital to prevent complications like heart disease, kidney failure, nerve damage, and blindness. Traditional diagnostic methods are often time-consuming and prone to subjective errors. Advances in medical research and Machine Learning (ML) have significantly improved prediction models [ 9 ]. These techniques enable deeper analysis of complex healthcare data, revealing patterns that conventional methods might miss [ 10 ]. ML models can enhance diagnostic accuracy and support timely interventions, especially as access to medical data grows. As a result, ML is now central to building automated systems capable of identifying individuals at diabetes risk [ 11 ]. Traditional ML techniques like Artificial Neural Network (ANN) and Support Vector Machine (SVM) are relatively simple and interpretable. However, the complex nature of medical data—featuring nonlinearity, missing values, and class imbalance—limits their effectiveness [ 12 ]. Despite the promising results from Nematollahi et al., challenges persist in improving prediction accuracy, particularly when complex interactions between multiple features occur [ 13 ]. These limitations can reduce accuracy, especially when multiple features interact in complex ways. Furthermore, using a single model may lead to overfitting on noisy or skewed datasets. To overcome this, ensemble learning techniques have gained popularity. By combining the strengths of models like gradient boosting for outlier handling and k-nearest neighbors for detecting local trends, ensemble methods offer a more robust and adaptable diagnostic approach. Building on this foundation, the present study offers a comprehensive exploration of both machine learning and deep learning models for predicting type 2 diabetes using body composition data. By integrating advanced preprocessing techniques and innovative data augmentation via Conditional Tabular Generative Adversarial Networks (CTGAN), our approach aims to overcome common challenges such as missing data and class imbalance. We systematically evaluated a diverse set of models to identify the most effective strategies for capturing complex patterns within the data. This combined methodology not only improves predictive accuracy but also enhances the robustness and generalizability of the results, paving the way for more reliable clinical decision support tools in diabetes risk assessment. The main contributions of this work are: 1) We used real-world data from a subset of the Fasa Cohort Study in Iran, despite challenges such as outliers and class imbalance. 2) Outliers were managed using min-max feature scaling. 3) To address class imbalance, we applied a CTGAN. 4) Ten classification models were evaluated using 5-fold cross-validation: Multilayer Perceptron (MLP), gradient boosting, random forest, lightGBM, TabNet, Extreme Gradient Boosting (XGBoost), AdaBoost, logistic regression, decision tree, and Linear Discriminant Analysis (LDA). 5) Among all, MLP achieved the highest accuracy of 93.91%, outperforming existing models. The rest of the paper is organized as follows: Section 2 reviews conventional and advanced diabetes prediction methods. Section 3 details the materials and methodology. Section 4 presents experimental results, and Section 5 concludes with contributions and future research directions. 2. Related work Several studies have explored diabetes prevalence and prediction using various traditional and advanced methods. Guariguata et al. employed Logistic Regression (LR) on data from 565 sources stored in a MySQL database, highlighting advantages such as simplicity, adaptability, and reproducibility [ 14 ]. However, their method lacked age-specific estimates and did not consider lifestyle or obesity factors. Peña et al. incorporated anthropometric and biochemical measurements, physical activity, and diet to estimate type 2 diabetes prevalence in Mexico City, accounting for crucial lifestyle factors but unable to establish strong diet-exercise relationships due to sample limitations [ 15 ]. Lee and Kim investigated T2D risk in Korean adults using anthropometry, Waist Circumference (WC), and triglycerides (TG) with binary LR and Naive Bayes (NB) and 10-fold cross-validation; however, reliance on raw WC and TG values limited generalizability and causal interpretation [ 16 ]. Zhu et al. examined abdominal fat distribution using CT imaging and LR but did not analyze fat distribution among healthy individuals [ 17 ]. Elgendy et al. introduced an explainable two-stage ensemble combining Local Outlier Factor (LOF), autoencoders, Synthetic Minority Over-sampling Technique (SMOTE), and shapley additive explanations (SHAP) on the MIMIC-IV dataset, achieving 92.54% accuracy [ 18 ]. Additionally, they proposed a graph-based framework modeling relationships among patients with similar conditions through a patient network, employing centrality measures and demographic features to train multiple classifiers. The Random Forest model performed best, with area under the ROC curve (AUC) values between 0.79 and 0.91, demonstrating the value of latent structural information in prediction. Building on this, in [ 19 ], a classification pipeline was developed using a weighted ensemble of five machine learning models, including Decision Tree (DT), Random Forest (RF), XGBoost (XGB), and Light Gradient Boosting Machines (LightGBM). Key preprocessing steps involved missing value imputation, feature selection, and hyperparameter tuning via grid search. This ensemble achieved an accuracy of 73.5% and an AUC of 0.832, showing substantial performance improvement over individual models. Naseem et al. implemented a patient health monitoring system utilizing six machine learning models, including both traditional and deep learning approaches [ 20 ]. Among them, the Recurrent Neural Network (RNN) achieved the highest accuracy (81%), while the Artificial Neural Network (ANN) delivered the highest recall. This system was designed to assist early diagnosis of chronic diseases by leveraging ML-based decision support. Nematollahi et al. further explored diabetes prediction on Fasa cohort data, applying XGBoost with oversampling via ADASYN to address class imbalance, achieving 89.96% accuracy [ 21 ]. Moreover, a notable recent study by Nematollahi et al. examined the association between body fat distribution and diabetes using machine learning and Analysis of Variance (ANOVA) [ 13 ]. By combining individual classifiers and ensemble learning, alongside ADASYN oversampling, they reached an impressive 92.04% accuracy using XGBoost. To improve classification approaches, authors in [ 22 ] proposed a hybrid Support Vector Machine (SVM) kernel based on Radial Basis Function (RBF) and city-block metrics. They addressed class imbalance with SMOTE and data quality via median imputation. The model achieved 87% precision, underscoring its potential for clinical diagnostics of T2D. In [ 23 ], an ensemble-based AI system was developed employing Harmony Search for feature selection and hyperparameter optimization. Tested on both Western and Eastern medical datasets, the system attained 93.09% accuracy on the PIMA dataset, demonstrating efficient model complexity reduction while maintaining strong predictive power. The KE-CNN model introduced in [ 24 ] applied a hybrid deep learning approach combining medical entity recognition with semantic knowledge expansion. Utilizing tools such as Bidirectional Encoder Representations from Transformers-Bidirectional Long Short-Term Memory-Conditional Random Fields (BERT-BiLSTM-CRF) and Word2Vec, this model captured richer feature representations, improving diabetes prediction. Its dual-channel CNN framework further enhanced accuracy by integrating structured and unstructured data inputs. Furthermore, Mushtaq et al. proposed a voting ensemble classifier composed of Naive Bayes (NB), Random Forest (RF), and Gradient Boosting models to mitigate outliers and class imbalance in diabetes datasets [ 25 ]. Data preprocessing included Tomek links for cleaning, SMOTE for balancing, and Interquartile Range (IQR)-based outlier removal. The ensemble achieved up to 82% accuracy, indicating reliability for early-stage diabetes detection. Nurzari et al. applied modified SMOTE and Random Forest classifiers, reaching an outstanding 99.7% accuracy [ 26 ]. However, despite the high accuracy reported in studies using SMOTE [ 26 , 27 ], these methods suffer from limited sample diversity and issues near class boundaries. Our model addresses these limitations by employing conditional tabular generative adversarial networks, representing a key contribution. Additionally, while most prior works rely on a single classifier, our approach evaluates different models, enabling a more comprehensive comparison for T2D prediction on body composition data. The proposed approach demonstrated outstanding performance across several evaluation metrics, achieving an accuracy of 93.91%. 3. Material and methods This section describes the methodological framework adopted in the present study. The workflow of the proposed method is summarized in Fig. 1 . As illustrated, the process begins with dataset importation into Google Colab, followed by a series of data preprocessing steps including handling missing values, data cleaning, and categorical encoding. Subsequently, features are normalized or standardized to ensure consistent scaling across predictors. In the next stage, synthetic data are generated using CTGAN to enrich the dataset and address data imbalance issues. The dataset is then partitioned into training and testing subsets, with model evaluation conducted using stratified 5-fold cross-validation. Finally, a range of machine learning and deep learning classifiers are trained and compared, and their performance is assessed through established evaluation metrics to identify the most effective model for diabetes prediction. To enhance clarity, this section is organized into the following subsections: Data Source and Ethical Considerations (3.1), Data Preprocessing (3.2), Synthetic Data Generation with CTGAN (3.3), Model Training and Validation (3.4), and Classifiers (3.5). The pseudocode of the proposed approach is presented below. Begin 1. Data Collection & Preprocessing - Load dataset (subset of Fasa Cohort Study) - Inspect data for missing values - Handle missing values (e.g., imputation/removal) 2. Data Cleaning & Encoding - Convert categorical features into numerical form (e.g., One-Hot Encoding / Label Encoding) - Detect and handle outliers 3. Feature Scaling - Apply Min-Max Scaling: X_scaled = (X - X_min) / (X_max - X_min) 4. Data Augmentation - Generate synthetic samples using CTGAN - Merge synthetic data with original dataset 5. Data Splitting & Cross Validation - Apply stratified 5-fold cross-validation - Split data into training and validation sets in each fold 6. Model Training & Evaluation For each classifier in {MLP, XGBoost, Gradient Boosting, Random Forest, LightGBM, TabNet, Logistic Regression, Linear Discriminant Analysis, AdaBoost, Decision Tree}: - Train model on training data - Validate model on validation data - Record performance metrics (Accuracy, Precision, Recall, F1, AUC, etc.) 7. Result Interpretation - Compare performance of all models - Identify best performing model based on validation metrics - Report test set performance of the selected model End 3.1. Data Source and Ethical Considerations The dataset used in this study is a subset of the Fasa Cohort Study conducted in Iran [ 28 ]. The original cohort investigates the association between various risk factors and chronic non-communicable diseases (NCDs) among rural residents of Fasa, a city with a population of approximately 250,000 in Fars Province [ 21 ]. For the purpose of this study, we focused exclusively on participants with diabetes and healthy conditions, along with their body composition measures. Our subset consists of 4,661 samples, including 2,155 males and 2,506 females, with ages ranging from 35 to 70 years. All methods were carried out in accordance with relevant guidelines and regulations. The study protocol was reviewed and approved by the Medical Ethics Committee of the School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran (IR.SUMS.MED.REC.1401.167). Informed consent was obtained from all participants prior to their inclusion in the study, ensuring ethical compliance and participant confidentiality. 3.2. Data Preprocessing Data Preprocessing includes handling missing values through imputation techniques such as mean, median, mode, or predictive methods; cleaning and encoding data by removing duplicates, converting categorical features to numeric format, and treating outliers using statistical approaches. The encoding step can be formally expressed as: $$\:X\_encoded\:=\:OneHotEncode\left(X\right)$$ 1 where $\:X\_encoded$ is the transformed data. Next, features are standardized using Z-score normalization, which ensures consistent scaling across features and centers them around zero with unit variance: $$\:\text{X}\_\text{s}\text{c}\text{a}\text{l}\text{e}\text{d}\:=\:(\text{X}\:-\:{\mu\:})\:/\:{\sigma\:}\:\:\:$$ 2 These steps facilitated homogeneous feature representation and enhanced model stability during training. 3.3. Synthetic Data Generation with CTGAN To address limitations in dataset size and class imbalance, synthetic samples were generated using CTGAN [ 29 ]. The CTGAN has been shown to effectively generate realistic tabular data, especially in healthcare applications where privacy restrictions limit data availability. The synthetic records were integrated with the original dataset, thereby improving the diversity of training samples and supporting model generalization. 3.4. Model Training and Validation The dataset was divided into training and test subsets. Stratified 5-fold cross-validation was adopted to ensure robust performance evaluation while preserving class distribution across folds. Models were trained on the training set and validated across folds, and the final evaluation was performed on the independent test set. Performance metrics included accuracy, precision, recall, F1-score, and the AUC. 3.5. Classifiers used This subsection provides a detailed overview of the classifiers investigated in the study. The choice of algorithms was motivated by their proven applicability to medical data, capacity for handling heterogeneous feature types, and varying degrees of interpretability. Below, each classifier is introduced along with its strengths, limitations, and reported applications in diabetes prediction tasks. 3.4.1. Decision Trees Decision tree models are intuitive and interpretable algorithms that recursively partition data into subsets according to feature values. Variants such as CART [ 30 ], C4.5 [ 31 ], CHAID [ 32 ], and QUEST [ 33 ] introduce refinements including pruning, multi-way splitting, and unbiased variable selection. Decision trees are suitable both as standalone models and as base learners in ensemble frameworks such as random forests [ 34 ]. 3.4.2. Random Forest Random forest employs bootstrap aggregation and random feature selection to generate diverse decision trees, reducing variance and improving generalization [ 34 ]. In healthcare applications, random forest has demonstrated robustness to noisy and imbalanced data and provides clinically meaningful feature importance rankings [ 35 , 36 ]. 3.4.3. Logistic Regression Logistic regression remains a benchmark classifier in medical research due to its interpretability and statistical grounding. It quantifies the impact of clinical features such as glucose levels, age, and BMI on diabetes risk [ 37 , 38 ]. Extensions such as one-vs-rest allow application to multiclass problems, while coefficients provide insights into odds ratios relevant to clinical practice. 3.4.4. Gradient Boosting Machines Gradient boosting machines iteratively train weak learners to correct errors from prior iterations, producing highly predictive models. It has been successfully applied to diabetes prediction tasks, capturing nonlinear relationships in clinical and behavioral data [ 39 ]. However, it requires careful hyperparameter tuning to avoid overfitting. 3.4.5. LightGBM LightGBM enhances GBM by employing histogram-based algorithms and leaf-wise growth strategies, improving efficiency and scalability [ 40 ]. It supports categorical features directly and is particularly useful in large-scale medical datasets requiring rapid inference. 3.4.6 XGBoost XGBoost introduces second-order optimization and regularization, offering strong predictive power and resilience to missing data [ 41 ]. It has consistently outperformed traditional classifiers in healthcare studies, including Type 2 diabetes prediction [ 34 ]. 3.4.7 TabNet TabNet leverages attention mechanisms to dynamically select features at each decision step, enabling interpretability while modeling complex nonlinear interactions [ 42 ]. This is especially valuable for medical prediction tasks where transparency and explainability are crucial. 3.4.8 AdaBoost Adaptive Boosting sequentially combines weak learners, emphasizing misclassified instances at each iteration. While effective in clean datasets, it is sensitive to noise and often benefits from integration with data cleaning or balancing strategies [ 43 ]. 3.4.9 Linear Discriminant Analysis Linear discriminant analysis projects data onto lower-dimensional spaces that maximize class separability. Its simplicity and computational efficiency make it valuable for early-stage classification in medical datasets [ 44 ]. LDA can also serve as a dimensionality reduction step prior to applying complex classifiers. 4. Results In this section, we present the experimental results of the proposed diabetes prediction framework. The simulations were conducted in the Google Colab environment, using a carefully pre-processed dataset derived from the Fasa Cohort Study. We first describe the detailed body composition measures collected from participants in Section 4.1 , followed by the evaluation metrics used to assess model performance in Section 4.2 . Section 4.3 presents the simulation results, including data preprocessing, handling of imbalances, and feature correlations. The cross-validation strategy employed is detailed in Section 4.4 , while Section 4.5 discusses the implementation and hyperparameter settings of various machine learning models. Finally, Section 4.6 compares the performance of our proposed method with existing state-of-the-art approaches. 4.1. Body Composition Measures All participants were assessed for body composition using the FDA-approved Tanita Segmental Body Composition Analyzer BC-418 MA (Tanita Corp, Japan) [ 21 ]. The data was collected while each subject stood barefoot on the device while gripping the attached handles, allowing bioelectrical impedance measurements through eight polar electrodes at the contact points. These measurements determined total body water, fat mass, fat-free mass, fat percentage for the entire body, specific regions on the left and right sides, and basal metabolic rate. 4.2. Evaluation metrics The model’s performance is evaluated using accuracy, precision, recall, and F1-score metrics. The formulas used to calculate each evaluation measure are presented in the Table 1 . Table 1 Evaluation Metrics for Classification Models. Metric Formula Accuracy (TP + TN) / (TP + TN + FP + FN) True Positive Rate (TPR) TP / (TP + FN) False Positive Rate (FPR) FP / (FP + TN) AUROC AUROC = ∫₀¹ TPR d(FPR) Precision TP / (TP + FP) Recall TP / (TP + FN) F1 Score 2 × Precision × Recall / (Precision + Recall) Matthews Corr. Coefficient (MCC) ((TP × TN) − (FP × FN)) / $\:\sqrt{\left(\right(\text{T}\text{P}+\text{F}\text{P}\left)\right(\text{T}\text{P}+\text{F}\text{N}\left)\right(\text{T}\text{N}+\text{F}\text{P}\left)\right(\text{T}\text{N}+\text{F}\text{N}\left)\right)}$ G-Mean $\:\sqrt{(\text{T}\text{P}\text{R}\:\times\:\:(1\:-\:\text{F}\text{P}\text{R}\left)\right)}$ Based on the definitions presented in Table 1, the key evaluation terµs used in the perforµance µetrics are explained as follows: • True Positives (TP) : The number of samples that actually belong to the positive class and are correctly predicted as positive by the model. • True Negatives (TN) : The number of samples that actually belong to the negative class and are correctly predicted as negative by the model. • False Positives (FP) : The number of samples that actually belong to the negative class but are incorrectly predicted as positive by the model. • False Negatives (FN) : The number of samples that actually belong to the positive class but are incorrectly predicted as negative by the model. • True Positive Rate (TPR) : The proportion of actual positive samples that are correctly identified as positive. • False Positive Rate (FPR) : The proportion of actual negative samples that are incorrectly identified as positive. 4.3. Simulation Results As shown in Fig. 1 , the collected data is initially pre-processed. The pre-processing involves the following steps. In general, the data may come with some missing entries. The final results with missing values may mislead the accuracy of the model. Therefore, it is important to address this issue. Here, we have used three types of imputations. The first one is mode imputation on categorical features, replacing missing values with the most often repeated values. The second one is mean imputation on numerical features. It replaces the missing value with the mean of the feature column. Last, drop the complete sample (row) where the target label is missing. Later, all the features are converted to numerical values using the encoding scheme presented in Section 3 to process them further. The statistical description of the data after the above step is shown in Table 1 . Table 1 The Descriptive statistics of dataset features. Column dtype unique min max median Standard deviation outliers lower bound upper bound GenderID int64 2 1 2 2.0 0.5 0 -0.5 3.5 Age float64 37 35.0 70.0 46.0 9.36 0 15.0 79.0 bmr int64 3703 1132 5837 2071.0 1019.19 87 3301.0 8557.0 FATP float64 471 1.5 53.6 27.8 10.04 0 -2.2 57.8 FATM float64 443 0.7 67.5 18.7 9.05 52 -5.65 42.75 FFM float64 439 8.5 83.6 46.8 8.78 46 23.95 71.35 TBW float64 401 20.9 62.4 34.3 6.43 46 17.55 52.35 IMP int64 435 32 949 613.0 79.63 71 411.0 819.0 RLFATP float64 490 1.5 55.5 32.9 12.93 0 -18.3 76.9 RLFATM float64 104 0.1 12.9 3.5 1.91 12 -2.5 9.5 RLFFM float64 101 4.7 18.4 7.9 1.65 43 3.45 12.65 LLFATP float64 478 1.5 55.2 33.0 12.8 0 -17.65 76.35 LLFATM float64 102 0.2 12.9 3.4 1.89 12 -2.48 9.33 LLFFM float64 97 4.7 17.5 7.7 1.61 48 3.5 12.3 RAFATP float64 490 2.3 63.4 25.3 11.48 0 -10.6 64.6 RAFATM float64 41 1.0 5.3 2.4 0.56 109 0.45 4.35 RAFFM float64 41 0.1 5.6 2.4 0.65 48 -0.75 4.35 LAFATP float64 503 2.2 64.3 26.4 11.65 0 -10.65 65.75 LAFATM float64 45 0.1 6.2 2.0 0.62 111 -0.6 4.6 LAFFM float64 41 1.3 5.3 2.4 0.65 51 0.75 4.35 TRFATP float64 439 3.0 51.0 26.9 9.18 2 -2.05 55.95 TRFATM float64 244 1.0 10.3 4.4 1.46 27 -1.85 10.65 TRFFM float64 242 15.7 44.9 26.2 4.37 52 15.0 38.2 Table 1 shows that the feature values vary with various ranges, and many outliers are present. Therefore, these issues are addressed using the preprocessing steps mentioned in Section 3. In ML algorithms, accuracy depends on a very important segment called data imbalance. When the data classes are imbalanced or skewed, the results are biased towards the majority class labels. Hence, we applied a synthetic minority sample generation (CTGAN) to increase the minority samples (diabetes = 571) to equal the majority samples (healthy = 4661). Finally, the dataset size becomes 9332 (diabetes = 4661, and healthy = 4661). To further understand the importance of these body composition features with class labels, we have presented a correlation matrix with a heat map in Fig. 2 . 4.4. Cross-Validation The next important step before training an ML is cross-validation. It decided how we split our data for training, validation, and testing. In this method, we have used a stratified five-fold cross-validation. In this validation, the dataset is first divided into five equal-sized subsets (folds) while ensuring that each fold maintains the same proportion of each class as the original dataset. The model is then trained five times, using four folds for training and the remaining fold for validation. This process helps reduce bias and variance, providing a more reliable estimate of model performance, especially for imbalanced datasets. 4.5. ML implementation Finally, we have trained several ML algorithms simultaneously, as shown in Fig. 1 . Table 2 presents the hyperparameters used for all these models. Table 3 shows the final performance measures of all these models on our data. From the table, it is evident that almost all the models are providing better results, but MLP has outperformed all the models in terms of every performance measure. The area under the curve plot for this model is shown in Fig. 3 , and a comparative visualization of all models based on evaluation criteria is illustrated in Fig. 4 . Table 2 ML models and parameters. Model Parameters MLP hidden_layer_sizes=(100,), activation=‘relu’, solver=‘adam’, alpha = 0.0001, batch_size=‘auto’, learning_rate=‘constant’, max_iter = 200 Gradient Boosting loss=‘log loss’, learning_rate = 0.1, n_estimators = 100, max_depth = 3, min_samples_split = 2, min_samples_leaf = 1, subsample = 1.0 LightGBM boosting_type=‘gbdt’, num_leaves = 31, learning_rate = 0.1, n_estimators = 100, max_depth=-1, min_child_samples = 20, reg_alpha = 0.0, reg_lambda = 0.0 XGBoost booster=‘gbtree’, learning_rate = 0.3, n_estimators = 100, max_depth = 6, min_child_weight = 1, subsample = 1.0, colsample_bytree = 1.0, reg_alpha = 0, reg_lambda = 1 Random Forest n_estimators = 100, criterion=‘gini’, max_depth = None, min_samples_split = 2, min_samples_leaf = 1, bootstrap = True TabNet optimizer_fn = torch.optim.Adam, optimizer_params=‘lr’: 5e-4, scheduler_fn = torch.optim.lr_scheduler.StepLR, scheduler_params=‘step_size’:10, ‘gamma’:0.9, mask_type=‘entmax’, max_epochs = 100, patience = 100, batch_size = 256, drop_last = False Logistic Regression penalty=‘l2’, C = 1.0, solver=‘lbfgs’, max_iter = 100, multi_class=‘auto’ AdaBoost n_estimators = 50, learning_rate = 1.0, algorithm=‘SAMME.R’ Linear Discriminant Analysis solver=‘svd’ Decision Tree criterion=‘gini’, splitter=‘best’, max_depth = None, min_samples_split = 2, min_samples_leaf = 1 Table 3 Performance Metrics of the proposed method using ML Models combined with CTGAN. Model Accuracy AUROC Precision Recall F1 Score MCC Geometric Mean Gradient Boosting 93.30% 93.26% 93.63% 93.26% 93.28% 86.90% 93.17% Light Gradient Boosting Machine 93.11% 93.08% 93.48% 93.08% 93.09% 86.55% 92.97% XGBoost 92.87% 92.84% 93.14% 92.84% 92.85% 85.97% 92.76% Random Forest 92.56% 92.53% 92.82% 92.53% 92.54% 85.35% 92.45% Tab Net 92.53% 92.52% 93.07% 92.52% 92.51% 85.58% 92.35% Logistic Regression 91.51% 91.46% 92.48% 91.46% 91.46% 83.93% 91.15% AdaBoost 90.16% 90.14% 90.34% 90.14% 90.14% 80.47% 90.08% Linear Discriminant Analysis 89.91% 89.84% 91.61% 89.84% 89.80% 81.44% 89.28% Decision Tree 87.76% 87.77% 87.77% 87.77% 87.76% 75.54% 87.76% MLP 93.91% 93.87% 94.48% 93.87% 93.89% 88.34% 93.71% The experimental results demonstrate a clear ranking in model performance for diabetes prediction. The MLP model emerged as the top performer across all metrics, achieving the highest accuracy (93.91%), AUROC (93.87%), and F1 Score (93.89%). This suggests that deep neural networks are highly effective at capturing complex patterns within the dataset. Following closely, Gradient Boosting, LightGBM, and XGBoost maintained high and competitive results, all exceeding 92% in each metric. These boosting-based methods benefit from sequential learning and error correction, making them well-suited for medical classification tasks with imbalanced or noisy data. Random Forest and TabNet also performed reliably, achieving around 92.5% in each category. While Random Forest provides robustness and interpretability, TabNet combines deep learning with attention mechanisms tailored for tabular data, maintaining accuracy while offering explainability. Logistic Regression showed moderate success with 91.51% accuracy, proving that linear models still hold value in medical contexts due to their simplicity and transparency. AdaBoost and LDA delivered decent but relatively lower results. While effective in some structured data scenarios, they were less competitive compared to ensemble and deep learning techniques. Decision Tree, when used individually, recorded the lowest performance. Its simplicity and tendency to overfit indicate that it should preferably be used within ensemble methods for better generalization. 4.6. Comparison with State-of-the-Art The use of body composition data for diabetes prediction is a relatively new field, with only a few published studies addressing this approach. As of now, we identify three key studies that have used the same dataset of 4,661 participants to develop machine learning-based diagnostic models for diabetes. Table 4 summarizes the comparative results of all three approaches. Also, Fig. 5 illustrates the performance comparison of different methods. Table 4 Performance comparison of diabetes prediction methods on the same dataset (4,661 samples). Study Method Accuracy (%) Precision (%) Recall (%) F1-score (%) Nematollahi et al., [ 21 ] (2024) Feature Selection + XGBoost 89.96 90.20 89.65 89.91 Nematollahi et al., [ 13 ] (2025) ANOVA + ADASYN + XGBoost 92.04 92.30 92.10 92.10 Our Method CTGAN + MLP 93.91 94.48 93.87 93.89 The first study [ 21 ] employed a basic feature selection process followed by the XGBoost algorithm, reporting an accuracy of 89.96%. While effective to some extent, this method lacked comprehensive preprocessing and did not incorporate any oversampling strategies to address class imbalance. The second study, by Nematollahi et al., introduced a more refined hybrid model combining ANOVA for feature selection with the ADASYN oversampling technique, alongside the XGBoost classifier [ 13 ]. Their method achieved better performance across all evaluation metrics, with an accuracy of 92.04%, precision of 92.30%, recall of 92.10%, and F1-score of 92.10%. In contrast, our proposed method incorporates a more extensive preprocessing pipeline and leverages augmentation strategies to improve generalization. Furthermore, we evaluate multiple machine learning models with tuned hyperparameters under a rigorous validation framework. This results in the highest performance across all metrics: accuracy of 93.91%, precision of 94.48%, recall of 93.87%, and F1-score of 93.89%. 5. Conclusion and future work In conclusion, this research emphasizes the possibility of ML and DL models in assessing diabetes based on body composition data. A systematic preprocessing pipeline, including data cleaning, feature transformation, and augmentation, ensures high-quality input for classification models. Data augmentation using CTGAN ensures better data for model training. This preprocessing analysis improved the generalizability and accuracy of the predictive models. MLP outperforms other ML models in accuracy, precision, recall, and F1-score, demonstrating its effectiveness in capturing intricate relationships in the dataset. The findings underscore the potential of ML approaches in enhancing diabetes prediction. However, to ensure the employability of this model in real time requires a few add-ons. Along with the considered parameters in this work, integration of the clinical parameters such as blood glucose levels, HbA1c, family history, lifestyle factors, and genetic markers can improve the diagnostic accuracy. Applying the models to real-world clinical datasets from diverse populations and geographical locations will help validate and generalize the findings. Deploying a user interface software facilitates the frequent use of this proposed method by clinicians. Declarations Conflict of Interest: The authors declare that they have no conflict of interest. Ethical Approval: This study is based on a subset of the Fasa Cohort Study conducted in Iran. The cohort study was approved by the Ethics Committee of Shiraz University of Medical Sciences. All participants provided written informed consent prior to data collection. Funding: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors. Author Contribution Conceptualization: J.H.J, M.M, and M.A.N: Formal Analysis and Methodology, M.M: Original Draft, K.N.R, K.R.R., and S.P.V: Editing. J.H.J, P.K, and M.A.N: Supervised by P.K. All authors approved the final manuscript. Acknowledgement Dear Editor, I would like to sincerely thank the editor for their time and effort in handling our manuscript.I would also like to mention that I have been serving as a reviewer for Scientific Reports for several years, and I have previously published multiple articles in this journal. I deeply value the mission and standards of Scientific Reports, and I hope to continue contributing to its success.Best regards,Javad Hassannataj Joloudari, Corresponding author Data Availability The dataset analyzed in this study is a subset of the Fasa Cohort Study conducted in Iran. Due to privacy regulations and ethical considerations, the data cannot be publicly shared. However, researchers may request access to the dataset from the corresponding author, subject to appropriate ethical approvals and institutional agreements. References Mayya, V. et al. Need for an Artificial Intelligence-based Diabetes Care Management System in India and the United States. Health Serv. Res. Managerial Epidemiol. 11 , 23333928241275292 (2024). Katsarou, A. et al. Type 1 diabetes mellitus. Nat. reviews Disease primers . 3 (1), 1–17 (2017). Zhong, Z., Li, J., Clausi, D. A. & Wong, A. Generative adversarial networks and conditional random fields for hyperspectral image classification. IEEE Trans. cybernetics . 50 (7), 3318–3329 (2019). DeFronzo, R. A. et al. Type 2 diabetes mellitus. Nat. reviews Disease primers . 1 (1), 1–22 (2015). McIntyre, H. D. et al. Gestational diabetes mellitus. Nat. reviews Disease primers . 5 (1), 47 (2019). Camacho, M., Atehortúa, A., Wilkinson, T., Gkontra, P. & Lekadir, K. Low-cost predictive models of dementia risk using machine learning and exposome predictors. Health Technol. 15 (2), 355–365 (2025). Sun, H. et al. IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. Diabetes Res. Clin. Pract. 183 , 109119 (2022). Roglic, G. WHO Global report on diabetes: A summary. Int. J. Noncommunicable Dis. 1 (1), 3–8 (2016). Katiyar, N., Thakur, H. K. & Ghatak, A. Recent advancements using machine learning & deep learning approaches for diabetes detection: a systematic review, e-Prime-Advances in Electrical Engineering, Electronics and Energy , p. 100661, (2024). Nimmagadda, S. M., Suryanarayana, G., Kumar, G. B., Anudeep, G. & Sai, G. V. A Comprehensive Survey on Diabetes Type-2 (T2D) Forecast Using Machine Learning. Arch. Comput. Methods Eng. 31 (5), 2905–2923 (2024). Wee, B. F., Sivakumar, S., Lim, K. H., Wong, W. K. & Juwono, F. H. Diabetes detection based on machine learning and deep learning approaches. Multimedia Tools Appl. 83 (8), 24153–24185 (2024). Masson, G., Morais, F., Rocha, E. & Endo, P. T. Evaluation of artificial intelligence models for predicting low birth weight using Brazilian real data. Health Technol. 15 (1), 169–184 (2025). Nematollahi, M. A. et al. Evolution of diabetes prediction using the fusion of ANOVA, ADASYN technique and XGBoost based on body composition data. J. Diabetes Metabolic Disorders . 24 (2), 1–11 (2025). Guariguata, L., Whiting, D., Weil, C. & Unwin, N. The International Diabetes Federation diabetes atlas methodology for estimating global and national prevalence of diabetes in adults. Diabetes Res. Clin. Pract. 94 (3), 322–332 (2011). Escobedo-de la Peña, J., Ramírez-Hernández, J. A., Fernández-Ramos, M. T., González-Figueroa, E. & Champagne, B. Body fat percentage rather than body mass index related to the high occurrence of type 2 diabetes. Arch. Med. Res. 51 (6), 564–571 (2020). Lee, B. J. & Kim, J. Y. Identification of type 2 diabetes risk factors using phenotypes consisting of anthropometry and triglycerides based on machine learning. IEEE J. biomedical health Inf. 20 (1), 39–46 (2015). Zhu, A. et al. Correlation of abdominal fat distribution with different types of diabetes in a Chinese population, Journal of Diabetes Research , vol. no. 1, p. 651462, 2013. (2013). Elgendy, I. A., Hosny, M., Albashrawi, M. A. & Alsenan, S. Dual-stage explainable ensemble learning model for diabetes diagnosis. Expert Syst. Appl. 274 , 126899 (2025). Dutta, A. et al. Early prediction of diabetes using an ensemble of machine learning models. Int. J. Environ. Res. Public Health . 19 (19), 12378 (2022). Naseem, A. et al. Novel Internet of Things based approach toward diabetes prediction using deep learning models. Front. Public. Health . 10 , 914106 (2022). Nematollahi, M. A. et al. A cohort study on the predictive capability of body composition for Diabetes Mellitus using machine learning. J. Diabetes Metabolic Disorders . 23 (1), 773–781 (2024). Reza, M. S., Hafsha, U., Amin, R., Yasmin, R. & Ruhi, S. Improving SVM performance for type II diabetes prediction with an improved non-linear kernel: Insights from the PIMA dataset. Comput. Methods Programs Biomed. Update . 4 , 100118 (2023). Zhang, Z. et al. A novel evolutionary ensemble prediction model using harmony search and stacking for diabetes diagnosis. J. King Saud University-Computer Inform. Sci. 36 (1), 101873 (2024). Cheng, H., Zhu, J., Li, P. & Xu, H. Combining knowledge extension with convolution neural network for diabetes prediction. Eng. Appl. Artif. Intell. 125 , 106658 (2023). Mushtaq, Z. et al. Voting Classification-Based Diabetes Mellitus Prediction Using Hypertuned Machine‐Learning Techniques, Mobile Information Systems , vol. no. 1, p. 6521532, 2022. (2022). Nurzari, I., Sari, E., Harris, D. I., Priyatno, A. M. & Rusnedy, H. Inter-Cluster Distance-Based SMOTE Modification for Enhanced Diabetes Classification, ITEGAM-JETIA , vol. 11, no. 51, pp. 190–196, (2025). Lu, H., Uddin, S., Hajati, F., Moni, M. A. & Khushi, M. A patient network-based machine learning model for disease prediction: The case of type 2 diabetes mellitus. Appl. Intell. 52 (3), 2411–2422 (2022). Farjam, M. et al. A cohort study protocol to analyze the predisposing factors to common chronic non-communicable diseases in rural areas: Fasa Cohort Study. BMC public. health . 16 , 1–8 (2016). Ravuri, S. & Vinyals, O. Classification accuracy score for conditional generative models. Advances neural Inform. Process. systems , 32 , (2019). Steinberg, D. CART: classification and regression trees, in The top ten algorithms in data mining: Chapman and Hall/CRC, 193–216. (2009). Ruggieri, S. & Efficient, C. 5 [classification algorithm]. IEEE Trans. Knowl. Data Eng. 14 (2), 438–444 (2002). Huang, H., Lin, T. K. & Ngui, P. Analysing a mental health survey by chi-squared automatic interaction detection, Annals of The Academy of Medicine, Singapore , vol. 22, no. 3, pp. 332–337, (1993). Lee, S. & Park, I. Application of decision tree model for the ground subsidence hazard mapping near abandoned underground coal mines. J. Environ. Manage. 127 , 166–176 (2013). Tuysuzoglu, G. et al. Joint Tomek Links (JTL): An Innovative Approach to Noise Reduction for Enhanced Classification Performance. IEEE Access , (2025). Lu, M., Sadiq, S., Feaster, D. J. & Ishwaran, H. Estimating individual treatment effect in observational data using random forest methods. J. Comput. Graphical Stat. 27 (1), 209–219 (2018). Oshiro, T. M., Perez, P. S. & Baranauskas, J. A. How many trees in a random forest? in Machine Learning and Data Mining in Pattern Recognition: 8th International Conference, MLDM 2012, Berlin, Germany, July 13–20, 2012. Proceedings 8 , : Springer, pp. 154–168. (2012). Bender, R. & Grouven, U. Ordinal logistic regression in medical research. J. R. Coll. Physicians Lond. 31 (5), 546 (1997). Devika, S., Jeyaseelan, L. & Sebastian, G. Analysis of sparse data in logistic regression in medical research: A newer approach. J. Postgrad. Med. 62 (1), 26–31 (2016). Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 54 , 1937–1967 (2021). Sai, M. J. et al. An ensemble of Light Gradient Boosting Machine and adaptive boosting for prediction of type-2 diabetes. Int. J. Comput. Intell. Syst. 16 (1), 14 (2023). Dong, W., Huang, Y., Lehane, B. & Ma, G. XGBoost algorithm-based prediction of concrete electrical resistivity for structural health monitoring. Autom. Constr. 114 , 103155 (2020). Wang, H. et al. Enhancing predictive accuracy for urinary tract infections post-pediatric pyeloplasty with explainable AI: an ensemble TabNet approach. Sci. Rep. 15 (1), 2455 (2025). Wang, C., Xu, S. & Yang, J. Adaboost algorithm in artificial intelligence for optimizing the IRI prediction accuracy of asphalt concrete pavement, Sensors , vol. 21, no. 17, p. 5682, (2021). Zhu, F., Gao, J., Yang, J. & Ye, N. Neighborhood linear discriminant analysis. Pattern Recogn. 123 , 108422 (2022). Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 10 Dec, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 15 Sep, 2025 Reviews received at journal 13 Sep, 2025 Reviews received at journal 12 Sep, 2025 Reviews received at journal 10 Sep, 2025 Reviews received at journal 09 Sep, 2025 Reviews received at journal 07 Sep, 2025 Reviewers agreed at journal 07 Sep, 2025 Reviews received at journal 07 Sep, 2025 Reviewers agreed at journal 07 Sep, 2025 Reviewers agreed at journal 07 Sep, 2025 Reviewers agreed at journal 07 Sep, 2025 Reviewers agreed at journal 07 Sep, 2025 Reviewers agreed at journal 07 Sep, 2025 Reviewers invited by journal 07 Sep, 2025 Editor assigned by journal 02 Sep, 2025 Editor invited by journal 22 Aug, 2025 Submission checks completed at journal 21 Aug, 2025 First submitted to journal 19 Aug, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7344799","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":513259859,"identity":"107a0800-d142-42f5-9683-f3173fca1565","order_by":0,"name":"Javad Hassannataj Joloudari","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABQ0lEQVRIie2RQUuEQBSAnwjbZbazi5F/YUQwOmz7VxRhT7IsCLFQkBLkRdarsgv9BffSWRnQi909RBALnjawiwQt0bh0qMbqGuQH83jzmI/3HgPQ0fEH4WwaEMQfKq5Hr4hmuyL+VcHAefnPyo5PCgjmex63PgbezeVqA3eS5JNyPT3fSnz4mJDZwclkP+YfKpjeM4N5pjJYQilHxfhICVIsO4uJluTIsAZxTxEAW+wuY01EQLhIAFVEPcw5CxMnNuL1KAaVTqoxil8aL1QZXftZLaJXPHLCvFEuqLJXtyqBkTZddDs2VbF/hXUnQI1CqILauwTr9HiJiREV5qnYnyuG49FdbJTpIUGWoLGK7OuXxWZGhnSwGxHVh8OVe0ueHO9Mn2fuqqq2rGI3EX+tcF7zAfQwAoD0TeWZfdrR0dHxb3kDX+R2gdEkVmkAAAAASUVORK5CYII=","orcid":"","institution":"Red Crescent Society of the Islamic Republic of Iran","correspondingAuthor":true,"prefix":"","firstName":"Javad","middleName":"Hassannataj","lastName":"Joloudari","suffix":""},{"id":513259860,"identity":"11bcb63f-bc0f-4d2a-bb11-4690c1b9613a","order_by":1,"name":"Mohammad Maftoun","email":"","orcid":"","institution":"Islamic Azad University","correspondingAuthor":false,"prefix":"","firstName":"Mohammad","middleName":"","lastName":"Maftoun","suffix":""},{"id":513259861,"identity":"4c1442e9-982a-42ef-b891-72dfc5eccc90","order_by":2,"name":"Mohammad Ali Nematollahi","email":"","orcid":"","institution":"Fasa University","correspondingAuthor":false,"prefix":"","firstName":"Mohammad","middleName":"Ali","lastName":"Nematollahi","suffix":""},{"id":513259862,"identity":"34933b6f-d594-42b9-8a3e-1ed47380f6d5","order_by":3,"name":"Kandala N V P S Rajesh","email":"","orcid":"","institution":"VIT-AP University","correspondingAuthor":false,"prefix":"","firstName":"Kandala","middleName":"N V P S","lastName":"Rajesh","suffix":""},{"id":513259863,"identity":"965f561e-71ec-480e-9244-c135aeee68f4","order_by":4,"name":"S Prasanth Vaidya","email":"","orcid":"","institution":"BV RIT HYDERABAD College of Engineering for Women","correspondingAuthor":false,"prefix":"","firstName":"S","middleName":"Prasanth","lastName":"Vaidya","suffix":""},{"id":513259864,"identity":"b0f071fe-9a46-48c4-a0b8-1f3f0a02cc40","order_by":5,"name":"Kamireddy Rasool Reddy","email":"","orcid":"","institution":"NRI Institute of Technology","correspondingAuthor":false,"prefix":"","firstName":"Kamireddy","middleName":"Rasool","lastName":"Reddy","suffix":""},{"id":513259865,"identity":"b36780a9-3a16-447a-a3b0-975bab8ea126","order_by":6,"name":"Pirhossein Kolivand","email":"","orcid":"","institution":"Red Crescent Society of the Islamic Republic of Iran","correspondingAuthor":false,"prefix":"","firstName":"Pirhossein","middleName":"","lastName":"Kolivand","suffix":""}],"badges":[],"createdAt":"2025-08-11 09:38:27","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7344799/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7344799/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-31928-9","type":"published","date":"2025-12-10T15:58:59+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":91179761,"identity":"0a96ec48-d32d-419e-ba2b-1bbdb7551d31","added_by":"auto","created_at":"2025-09-12 12:38:51","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":46544,"visible":true,"origin":"","legend":"\u003cp\u003eBlock diagram of the proposed method.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7344799/v1/957e4dadaa2886f37b01728d.png"},{"id":91179758,"identity":"b04eb0e6-e5fc-4386-aebd-d691d4ffbbf4","added_by":"auto","created_at":"2025-09-12 12:38:50","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":373294,"visible":true,"origin":"","legend":"\u003cp\u003eCorrelation matrix of features with heatmap.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7344799/v1/0db60e8b8557e46991bef69a.png"},{"id":91179754,"identity":"f66bd512-2f00-45b1-976a-3497bd4fb920","added_by":"auto","created_at":"2025-09-12 12:38:50","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":213543,"visible":true,"origin":"","legend":"\u003cp\u003eROC graph of the proposed model.\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7344799/v1/195a3585ecacb6f7c7173e22.jpeg"},{"id":91179755,"identity":"eb886acb-6bab-4c24-8572-2d0351784040","added_by":"auto","created_at":"2025-09-12 12:38:50","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":117454,"visible":true,"origin":"","legend":"\u003cp\u003eBar charts of the used models based on evaluation criteria.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7344799/v1/a27739e0a2be1bb5aa931aec.png"},{"id":91179756,"identity":"d64c7758-1e63-464c-91a6-df9899cacfe8","added_by":"auto","created_at":"2025-09-12 12:38:50","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":54156,"visible":true,"origin":"","legend":"\u003cp\u003ePerformance Comparison of Different Methods.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7344799/v1/9e9d9ccd72981bad42b290a4.png"},{"id":98245153,"identity":"6a836c42-2903-4f81-ad4b-fde8c917296e","added_by":"auto","created_at":"2025-12-15 16:16:46","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2045789,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7344799/v1/90404b63-085c-4c39-a9f0-dd50991157d0.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Predicting Diabetes Mellitus using Conditional Tabular Generative Adversarial Networks combined with MLP based on Body Composition Data","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eDiabetes is a chronic metabolic condition where the body fails to effectively regulate blood sugar levels due to insufficient insulin production or insulin resistance [\u003cspan class=\"CitationRef\"\u003e1\u003c/span\u003e]. Its prevalence continues to increase due to sedentary lifestyles, unhealthy diets, and genetic predispositions. Diabetes is classified mainly into three types: Type 1 (T1D), Type 2 (T2D), and Gestational Diabetes (GD).\u003c/p\u003e\n\u003cp\u003eT1D generally develops from pancreatic beta cell loss, mainly in children [\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003e]. T2D arises when the body cannot produce enough insulin to maintain normal blood sugar levels, primarily affecting adults [\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e]. GD is a temporary condition during pregnancy that usually resolves after delivery, although it may increase the risk of T2D later in life [\u003cspan class=\"CitationRef\"\u003e5\u003c/span\u003e]. T2D accounts for 90% of all diabetes cases. Symptoms vary according to the type and severity, but common ones include polyuria, polydipsia, polyphagia, weight loss, blurred vision, and slow-healing wounds [\u003cspan class=\"CitationRef\"\u003e6\u003c/span\u003e]. According to the International Diabetes Federation (IDF) [\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e] and the World Health Organization (WHO) [\u003cspan class=\"CitationRef\"\u003e8\u003c/span\u003e], 537\u0026nbsp;million adults (20\u0026ndash;79 years) had diabetes in 2021.\u003c/p\u003e\n\u003cp\u003eThis figure is projected to increase to 643\u0026nbsp;million by 2030 and 783\u0026nbsp;million by 2045 [\u003cspan class=\"CitationRef\"\u003e7\u003c/span\u003e]. In 2021, around 6.7\u0026nbsp;million deaths were attributed to diabetes, making it a major cause of mortality. The economic burden is vast, with healthcare costs exceeding \u003cspan\u003e$\u003c/span\u003e966 billion\u0026mdash;a 316% rise over the past 15 years [\u003cspan class=\"CitationRef\"\u003e8\u003c/span\u003e]. Additionally, 44.7% of diabetic adults remain undiagnosed, resulting in severe complications due to delayed care.\u003c/p\u003e\n\u003cp\u003eEarly detection is thus vital to prevent complications like heart disease, kidney failure, nerve damage, and blindness.\u003c/p\u003e\n\u003cp\u003eTraditional diagnostic methods are often time-consuming and prone to subjective errors.\u003c/p\u003e\n\u003cp\u003eAdvances in medical research and Machine Learning (ML) have significantly improved prediction models [\u003cspan class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eThese techniques enable deeper analysis of complex healthcare data, revealing patterns that conventional methods might miss [\u003cspan class=\"CitationRef\"\u003e10\u003c/span\u003e].\u003c/p\u003e\n\u003cp\u003eML models can enhance diagnostic accuracy and support timely interventions, especially as access to medical data grows. As a result, ML is now central to building automated systems capable of identifying individuals at diabetes risk [\u003cspan class=\"CitationRef\"\u003e11\u003c/span\u003e]. Traditional ML techniques like Artificial Neural Network (ANN) and Support Vector Machine (SVM) are relatively simple and interpretable.\u003c/p\u003e\n\u003cp\u003eHowever, the complex nature of medical data\u0026mdash;featuring nonlinearity, missing values, and class imbalance\u0026mdash;limits their effectiveness [\u003cspan class=\"CitationRef\"\u003e12\u003c/span\u003e]. Despite the promising results from Nematollahi et al., challenges persist in improving prediction accuracy, particularly when complex interactions between multiple features occur [\u003cspan class=\"CitationRef\"\u003e13\u003c/span\u003e]. These limitations can reduce accuracy, especially when multiple features interact in complex ways. Furthermore, using a single model may lead to overfitting on noisy or skewed datasets. To overcome this, ensemble learning techniques have gained popularity. By combining the strengths of models like gradient boosting for outlier handling and k-nearest neighbors for detecting local trends, ensemble methods offer a more robust and adaptable diagnostic approach.\u003c/p\u003e\n\u003cp\u003eBuilding on this foundation, the present study offers a comprehensive exploration of both machine learning and deep learning models for predicting type 2 diabetes using body composition data. By integrating advanced preprocessing techniques and innovative data augmentation via Conditional Tabular Generative Adversarial Networks (CTGAN), our approach aims to overcome common challenges such as missing data and class imbalance. We systematically evaluated a diverse set of models to identify the most effective strategies for capturing complex patterns within the data. This combined methodology not only improves predictive accuracy but also enhances the robustness and generalizability of the results, paving the way for more reliable clinical decision support tools in diabetes risk assessment.\u003c/p\u003e\n\u003cp\u003eThe main contributions of this work are:\u003c/p\u003e\n\u003cp\u003e\u003cspan\u003e\u003c/span\u003e\u003c/p\u003e\n\u003cp\u003e1) We used real-world data from a subset of the Fasa Cohort Study in Iran, despite challenges such as outliers and class imbalance.\u003c/p\u003e\n\u003cp\u003e\u003c/p\u003e\n\u003cp\u003e2) Outliers were managed using min-max feature scaling.\u003c/p\u003e\n\u003cp\u003e3) To address class imbalance, we applied a CTGAN.\u003c/p\u003e\n\u003cp\u003e4) Ten classification models were evaluated using 5-fold cross-validation: Multilayer Perceptron (MLP), gradient boosting, random forest, lightGBM, TabNet, Extreme Gradient Boosting (XGBoost), AdaBoost, logistic regression, decision tree, and Linear Discriminant Analysis (LDA).\u003c/p\u003e\n\u003cp\u003e5) Among all, MLP achieved the highest accuracy of 93.91%, outperforming existing models.\u003c/p\u003e\n\u003cp\u003eThe rest of the paper is organized as follows: Section 2 reviews conventional and advanced diabetes prediction methods. Section 3 details the materials and methodology. Section 4 presents experimental results, and Section \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e concludes with contributions and future research directions.\u003c/p\u003e"},{"header":"2. Related work","content":"\u003cp\u003eSeveral studies have explored diabetes prevalence and prediction using various traditional and advanced methods. Guariguata et al. employed Logistic Regression (LR) on data from 565 sources stored in a MySQL database, highlighting advantages such as simplicity, adaptability, and reproducibility [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. However, their method lacked age-specific estimates and did not consider lifestyle or obesity factors. Pe\u0026ntilde;a et al. incorporated anthropometric and biochemical measurements, physical activity, and diet to estimate type 2 diabetes prevalence in Mexico City, accounting for crucial lifestyle factors but unable to establish strong diet-exercise relationships due to sample limitations [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Lee and Kim investigated T2D risk in Korean adults using anthropometry, Waist Circumference (WC), and triglycerides (TG) with binary LR and Naive Bayes (NB) and 10-fold cross-validation; however, reliance on raw WC and TG values limited generalizability and causal interpretation [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Zhu et al. examined abdominal fat distribution using CT imaging and LR but did not analyze fat distribution among healthy individuals [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eElgendy et al. introduced an explainable two-stage ensemble combining Local Outlier Factor (LOF), autoencoders, Synthetic Minority Over-sampling Technique (SMOTE), and shapley additive explanations (SHAP) on the MIMIC-IV dataset, achieving 92.54% accuracy [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Additionally, they proposed a graph-based framework modeling relationships among patients with similar conditions through a patient network, employing centrality measures and demographic features to train multiple classifiers. The Random Forest model performed best, with area under the ROC curve (AUC) values between 0.79 and 0.91, demonstrating the value of latent structural information in prediction.\u003c/p\u003e\u003cp\u003eBuilding on this, in [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], a classification pipeline was developed using a weighted ensemble of five machine learning models, including Decision Tree (DT), Random Forest (RF), XGBoost (XGB), and Light Gradient Boosting Machines (LightGBM). Key preprocessing steps involved missing value imputation, feature selection, and hyperparameter tuning via grid search. This ensemble achieved an accuracy of 73.5% and an AUC of 0.832, showing substantial performance improvement over individual models.\u003c/p\u003e\u003cp\u003eNaseem et al. implemented a patient health monitoring system utilizing six machine learning models, including both traditional and deep learning approaches [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Among them, the Recurrent Neural Network (RNN) achieved the highest accuracy (81%), while the Artificial Neural Network (ANN) delivered the highest recall. This system was designed to assist early diagnosis of chronic diseases by leveraging ML-based decision support.\u003c/p\u003e\u003cp\u003eNematollahi et al. further explored diabetes prediction on Fasa cohort data, applying XGBoost with oversampling via ADASYN to address class imbalance, achieving 89.96% accuracy [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Moreover, a notable recent study by Nematollahi et al. examined the association between body fat distribution and diabetes using machine learning and Analysis of Variance (ANOVA) [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. By combining individual classifiers and ensemble learning, alongside ADASYN oversampling, they reached an impressive 92.04% accuracy using XGBoost.\u003c/p\u003e\u003cp\u003eTo improve classification approaches, authors in [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e] proposed a hybrid Support Vector Machine (SVM) kernel based on Radial Basis Function (RBF) and city-block metrics. They addressed class imbalance with SMOTE and data quality via median imputation. The model achieved 87% precision, underscoring its potential for clinical diagnostics of T2D.\u003c/p\u003e\u003cp\u003eIn [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e], an ensemble-based AI system was developed employing Harmony Search for feature selection and hyperparameter optimization. Tested on both Western and Eastern medical datasets, the system attained 93.09% accuracy on the PIMA dataset, demonstrating efficient model complexity reduction while maintaining strong predictive power.\u003c/p\u003e\u003cp\u003eThe KE-CNN model introduced in [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] applied a hybrid deep learning approach combining medical entity recognition with semantic knowledge expansion. Utilizing tools such as Bidirectional Encoder Representations from Transformers-Bidirectional Long Short-Term Memory-Conditional Random Fields (BERT-BiLSTM-CRF) and Word2Vec, this model captured richer feature representations, improving diabetes prediction. Its dual-channel CNN framework further enhanced accuracy by integrating structured and unstructured data inputs.\u003c/p\u003e\u003cp\u003eFurthermore, Mushtaq et al. proposed a voting ensemble classifier composed of Naive Bayes (NB), Random Forest (RF), and Gradient Boosting models to mitigate outliers and class imbalance in diabetes datasets [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Data preprocessing included Tomek links for cleaning, SMOTE for balancing, and Interquartile Range (IQR)-based outlier removal. The ensemble achieved up to 82% accuracy, indicating reliability for early-stage diabetes detection.\u003c/p\u003e\u003cp\u003eNurzari et al. applied modified SMOTE and Random Forest classifiers, reaching an outstanding 99.7% accuracy [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. However, despite the high accuracy reported in studies using SMOTE [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e], these methods suffer from limited sample diversity and issues near class boundaries.\u003c/p\u003e\u003cp\u003eOur model addresses these limitations by employing conditional tabular generative adversarial networks, representing a key contribution. Additionally, while most prior works rely on a single classifier, our approach evaluates different models, enabling a more comprehensive comparison for T2D prediction on body composition data. The proposed approach demonstrated outstanding performance across several evaluation metrics, achieving an accuracy of 93.91%.\u003c/p\u003e"},{"header":"3. Material and methods","content":"\u003cp\u003eThis section describes the methodological framework adopted in the present study. The workflow of the proposed method is summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. As illustrated, the process begins with dataset importation into Google Colab, followed by a series of data preprocessing steps including handling missing values, data cleaning, and categorical encoding. Subsequently, features are normalized or standardized to ensure consistent scaling across predictors. In the next stage, synthetic data are generated using CTGAN to enrich the dataset and address data imbalance issues. The dataset is then partitioned into training and testing subsets, with model evaluation conducted using stratified 5-fold cross-validation. Finally, a range of machine learning and deep learning classifiers are trained and compared, and their performance is assessed through established evaluation metrics to identify the most effective model for diabetes prediction.\u003c/p\u003e\u003cp\u003eTo enhance clarity, this section is organized into the following subsections: Data Source and Ethical Considerations (3.1), Data Preprocessing (3.2), Synthetic Data Generation with CTGAN (3.3), Model Training and Validation (3.4), and Classifiers (3.5). The pseudocode of the proposed approach is presented below.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"No\" id=\"Taba\" border=\"1\"\u003e\u003ccolgroup cols=\"1\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBegin\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e1. Data Collection \u0026amp; Preprocessing\u003c/p\u003e\u003cp\u003e- Load dataset (subset of Fasa Cohort Study)\u003c/p\u003e\u003cp\u003e- Inspect data for missing values\u003c/p\u003e\u003cp\u003e- Handle missing values (e.g., imputation/removal)\u003c/p\u003e\u003cp\u003e2. Data Cleaning \u0026amp; Encoding\u003c/p\u003e\u003cp\u003e- Convert categorical features into numerical form (e.g., One-Hot Encoding / Label Encoding)\u003c/p\u003e\u003cp\u003e- Detect and handle outliers\u003c/p\u003e\u003cp\u003e3. Feature Scaling\u003c/p\u003e\u003cp\u003e- Apply Min-Max Scaling:\u003c/p\u003e \u003cp\u003eX_scaled = (X - X_min) / (X_max - X_min)\u003c/p\u003e\u003cp\u003e4. Data Augmentation\u003c/p\u003e\u003cp\u003e- Generate synthetic samples using CTGAN\u003c/p\u003e\u003cp\u003e- Merge synthetic data with original dataset\u003c/p\u003e\u003cp\u003e5. Data Splitting \u0026amp; Cross Validation\u003c/p\u003e\u003cp\u003e- Apply stratified 5-fold cross-validation\u003c/p\u003e\u003cp\u003e- Split data into training and validation sets in each fold\u003c/p\u003e\u003cp\u003e6. Model Training \u0026amp; Evaluation\u003c/p\u003e\u003cp\u003eFor each classifier in {MLP, XGBoost, Gradient Boosting, Random Forest,\u003c/p\u003e\u003cp\u003eLightGBM, TabNet, Logistic Regression,\u003c/p\u003e\u003cp\u003eLinear Discriminant Analysis, AdaBoost, Decision Tree}:\u003c/p\u003e\u003cp\u003e- Train model on training data\u003c/p\u003e\u003cp\u003e- Validate model on validation data\u003c/p\u003e\u003cp\u003e- Record performance metrics (Accuracy, Precision, Recall, F1, AUC, etc.)\u003c/p\u003e\u003cp\u003e7. Result Interpretation\u003c/p\u003e\u003cp\u003e- Compare performance of all models\u003c/p\u003e\u003cp\u003e- Identify best performing model based on validation metrics\u003c/p\u003e\u003cp\u003e- Report test set performance of the selected model\u003c/p\u003e\u003cp\u003eEnd\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e3.1. Data Source and Ethical Considerations\u003c/h2\u003e\u003cp\u003eThe dataset used in this study is a subset of the Fasa Cohort Study conducted in Iran [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. The original cohort investigates the association between various risk factors and chronic non-communicable diseases (NCDs) among rural residents of Fasa, a city with a population of approximately 250,000 in Fars Province [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. For the purpose of this study, we focused exclusively on participants with diabetes and healthy conditions, along with their body composition measures. Our subset consists of 4,661 samples, including 2,155 males and 2,506 females, with ages ranging from 35 to 70 years.\u003c/p\u003e\u003cp\u003e All methods were carried out in accordance with relevant guidelines and regulations. The study protocol was reviewed and approved by the Medical Ethics Committee of the School of Medicine, Shiraz University of Medical Sciences, Shiraz, Iran (IR.SUMS.MED.REC.1401.167). Informed consent was obtained from all participants prior to their inclusion in the study, ensuring ethical compliance and participant confidentiality.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e3.2. Data Preprocessing\u003c/h2\u003e\u003cp\u003eData Preprocessing includes handling missing values through imputation techniques such as mean, median, mode, or predictive methods; cleaning and encoding data by removing duplicates, converting categorical features to numeric format, and treating outliers using statistical approaches. The encoding step can be formally expressed as:\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$\\:X\\_encoded\\:=\\:OneHotEncode\\left(X\\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003ewhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:X\\_encoded\$\u003c/span\u003e\u003c/span\u003e is the transformed data. Next, features are standardized using Z-score normalization, which ensures consistent scaling across features and centers them around zero with unit variance:\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$\\:\\text{X}\\_\\text{s}\\text{c}\\text{a}\\text{l}\\text{e}\\text{d}\\:=\\:(\\text{X}\\:-\\:{\\mu\\:})\\:/\\:{\\sigma\\:}\\:\\:\\:$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThese steps facilitated homogeneous feature representation and enhanced model stability during training.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.3. Synthetic Data Generation with CTGAN\u003c/h2\u003e\u003cp\u003eTo address limitations in dataset size and class imbalance, synthetic samples were generated using CTGAN [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. The CTGAN has been shown to effectively generate realistic tabular data, especially in healthcare applications where privacy restrictions limit data availability. The synthetic records were integrated with the original dataset, thereby improving the diversity of training samples and supporting model generalization.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.4. Model Training and Validation\u003c/h2\u003e\u003cp\u003eThe dataset was divided into training and test subsets. Stratified 5-fold cross-validation was adopted to ensure robust performance evaluation while preserving class distribution across folds. Models were trained on the training set and validated across folds, and the final evaluation was performed on the independent test set. Performance metrics included accuracy, precision, recall, F1-score, and the AUC.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.5. Classifiers used\u003c/h2\u003e\u003cp\u003eThis subsection provides a detailed overview of the classifiers investigated in the study. The choice of algorithms was motivated by their proven applicability to medical data, capacity for handling heterogeneous feature types, and varying degrees of interpretability. Below, each classifier is introduced along with its strengths, limitations, and reported applications in diabetes prediction tasks.\u003c/p\u003e\u003cdiv id=\"Sec12\" class=\"Section3\"\u003e\u003ch2\u003e3.4.1. Decision Trees\u003c/h2\u003e\u003cp\u003eDecision tree models are intuitive and interpretable algorithms that recursively partition data into subsets according to feature values. Variants such as CART [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], C4.5 [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e], CHAID [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e], and QUEST [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e] introduce refinements including pruning, multi-way splitting, and unbiased variable selection. Decision trees are suitable both as standalone models and as base learners in ensemble frameworks such as random forests [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section3\"\u003e\u003ch2\u003e3.4.2. Random Forest\u003c/h2\u003e\u003cp\u003eRandom forest employs bootstrap aggregation and random feature selection to generate diverse decision trees, reducing variance and improving generalization [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]. In healthcare applications, random forest has demonstrated robustness to noisy and imbalanced data and provides clinically meaningful feature importance rankings [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e, \u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section3\"\u003e\u003ch2\u003e3.4.3. Logistic Regression\u003c/h2\u003e\u003cp\u003eLogistic regression remains a benchmark classifier in medical research due to its interpretability and statistical grounding. It quantifies the impact of clinical features such as glucose levels, age, and BMI on diabetes risk [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e, \u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. Extensions such as one-vs-rest allow application to multiclass problems, while coefficients provide insights into odds ratios relevant to clinical practice.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section3\"\u003e\u003ch2\u003e3.4.4. Gradient Boosting Machines\u003c/h2\u003e\u003cp\u003eGradient boosting machines iteratively train weak learners to correct errors from prior iterations, producing highly predictive models. It has been successfully applied to diabetes prediction tasks, capturing nonlinear relationships in clinical and behavioral data [\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]. However, it requires careful hyperparameter tuning to avoid overfitting.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section3\"\u003e\u003ch2\u003e3.4.5. LightGBM\u003c/h2\u003e\u003cp\u003eLightGBM enhances GBM by employing histogram-based algorithms and leaf-wise growth strategies, improving efficiency and scalability [\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e]. It supports categorical features directly and is particularly useful in large-scale medical datasets requiring rapid inference.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section3\"\u003e\u003ch2\u003e3.4.6 XGBoost\u003c/h2\u003e\u003cp\u003eXGBoost introduces second-order optimization and regularization, offering strong predictive power and resilience to missing data [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e]. It has consistently outperformed traditional classifiers in healthcare studies, including Type 2 diabetes prediction [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section3\"\u003e\u003ch2\u003e3.4.7 TabNet\u003c/h2\u003e\u003cp\u003eTabNet leverages attention mechanisms to dynamically select features at each decision step, enabling interpretability while modeling complex nonlinear interactions [\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e]. This is especially valuable for medical prediction tasks where transparency and explainability are crucial.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section3\"\u003e\u003ch2\u003e3.4.8 AdaBoost\u003c/h2\u003e\u003cp\u003eAdaptive Boosting sequentially combines weak learners, emphasizing misclassified instances at each iteration. While effective in clean datasets, it is sensitive to noise and often benefits from integration with data cleaning or balancing strategies [\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section3\"\u003e\u003ch2\u003e3.4.9 Linear Discriminant Analysis\u003c/h2\u003e\u003cp\u003eLinear discriminant analysis projects data onto lower-dimensional spaces that maximize class separability. Its simplicity and computational efficiency make it valuable for early-stage classification in medical datasets [\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e]. LDA can also serve as a dimensionality reduction step prior to applying complex classifiers.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e"},{"header":"4. Results","content":"\u003cp\u003eIn this section, we present the experimental results of the proposed diabetes prediction framework. The simulations were conducted in the Google Colab environment, using a carefully pre-processed dataset derived from the Fasa Cohort Study. We first describe the detailed body composition measures collected from participants in Section \u003cspan refid=\"Sec22\" class=\"InternalRef\"\u003e4.1\u003c/span\u003e, followed by the evaluation metrics used to assess model performance in Section \u003cspan refid=\"Sec23\" class=\"InternalRef\"\u003e4.2\u003c/span\u003e. Section \u003cspan refid=\"Sec24\" class=\"InternalRef\"\u003e4.3\u003c/span\u003e presents the simulation results, including data preprocessing, handling of imbalances, and feature correlations. The cross-validation strategy employed is detailed in Section \u003cspan refid=\"Sec25\" class=\"InternalRef\"\u003e4.4\u003c/span\u003e, while Section \u003cspan refid=\"Sec26\" class=\"InternalRef\"\u003e4.5\u003c/span\u003e discusses the implementation and hyperparameter settings of various machine learning models. Finally, Section \u003cspan refid=\"Sec27\" class=\"InternalRef\"\u003e4.6\u003c/span\u003e compares the performance of our proposed method with existing state-of-the-art approaches.\u003c/p\u003e\u003cdiv id=\"Sec22\" class=\"Section2\"\u003e\u003ch2\u003e4.1. Body Composition Measures\u003c/h2\u003e\u003cp\u003eAll participants were assessed for body composition using the FDA-approved Tanita Segmental Body Composition Analyzer BC-418 MA (Tanita Corp, Japan) [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. The data was collected while each subject stood barefoot on the device while gripping the attached handles, allowing bioelectrical impedance measurements through eight polar electrodes at the contact points. These measurements determined total body water, fat mass, fat-free mass, fat percentage for the entire body, specific regions on the left and right sides, and basal metabolic rate.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec23\" class=\"Section2\"\u003e\u003ch2\u003e4.2. Evaluation metrics\u003c/h2\u003e\u003cp\u003eThe model\u0026rsquo;s performance is evaluated using accuracy, precision, recall, and F1-score metrics. The formulas used to calculate each evaluation measure are presented in the Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eEvaluation Metrics for Classification Models.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMetric\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFormula\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e(TP\u0026thinsp;+\u0026thinsp;TN) / (TP\u0026thinsp;+\u0026thinsp;TN\u0026thinsp;+\u0026thinsp;FP\u0026thinsp;+\u0026thinsp;FN)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTrue Positive Rate (TPR)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTP / (TP\u0026thinsp;+\u0026thinsp;FN)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFalse Positive Rate (FPR)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFP / (FP\u0026thinsp;+\u0026thinsp;TN)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAUROC\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAUROC = \u0026int;₀\u0026sup1; TPR d(FPR)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTP / (TP\u0026thinsp;+\u0026thinsp;FP)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRecall\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eTP / (TP\u0026thinsp;+\u0026thinsp;FN)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eF1 Score\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e2 \u0026times; Precision \u0026times; Recall / (Precision\u0026thinsp;+\u0026thinsp;Recall)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMatthews Corr. Coefficient (MCC)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e((TP \u0026times; TN) \u0026minus; (FP \u0026times; FN)) / \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\sqrt{\\left(\\right(\\text{T}\\text{P}+\\text{F}\\text{P}\\left)\\right(\\text{T}\\text{P}+\\text{F}\\text{N}\\left)\\right(\\text{T}\\text{N}+\\text{F}\\text{P}\\left)\\right(\\text{T}\\text{N}+\\text{F}\\text{N}\\left)\\right)}\$\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eG-Mean\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\$\\:\\sqrt{(\\text{T}\\text{P}\\text{R}\\:\\times\\:\\:(1\\:-\\:\\text{F}\\text{P}\\text{R}\\left)\\right)}\$\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eBased on the definitions presented in Table\u0026nbsp;\u0026lt;link rid=\"tb2\"\u0026gt;1\u0026lt;/link\u0026gt;, the key evaluation ter\u0026micro;s used in the perfor\u0026micro;ance \u0026micro;etrics are explained as follows:\u003c/p\u003e\u003cp\u003e\u0026bull; \u003cb\u003eTrue Positives (TP)\u003c/b\u003e: The number of samples that actually belong to the positive class and are correctly predicted as positive by the model.\u003c/p\u003e\u003cp\u003e\u0026bull; \u003cb\u003eTrue Negatives (TN)\u003c/b\u003e: The number of samples that actually belong to the negative class and are correctly predicted as negative by the model.\u003c/p\u003e\u003cp\u003e\u0026bull; \u003cb\u003eFalse Positives (FP)\u003c/b\u003e: The number of samples that actually belong to the negative class but are incorrectly predicted as positive by the model.\u003c/p\u003e\u003cp\u003e\u0026bull; \u003cb\u003eFalse Negatives (FN)\u003c/b\u003e: The number of samples that actually belong to the positive class but are incorrectly predicted as negative by the model.\u003c/p\u003e\u003cp\u003e\u0026bull; \u003cb\u003eTrue Positive Rate (TPR)\u003c/b\u003e: The proportion of actual positive samples that are correctly identified as positive.\u003c/p\u003e\u003cp\u003e\u0026bull; \u003cb\u003eFalse Positive Rate (FPR)\u003c/b\u003e: The proportion of actual negative samples that are incorrectly identified as positive.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec24\" class=\"Section2\"\u003e\u003ch2\u003e4.3. Simulation Results\u003c/h2\u003e\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, the collected data is initially pre-processed. The pre-processing involves the following steps.\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eIn general, the data may come with some missing entries. The final results with missing values may mislead the accuracy of the model. Therefore, it is important to address this issue. Here, we have used three types of imputations. The first one is mode imputation on categorical features, replacing missing values with the most often repeated values. The second one is mean imputation on numerical features. It replaces the missing value with the mean of the feature column. Last, drop the complete sample (row) where the target label is missing.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eLater, all the features are converted to numerical values using the encoding scheme presented in Section 3 to process them further. The statistical description of the data after the above step is shown in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eThe Descriptive statistics of dataset features.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"10\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c10\" colnum=\"10\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eColumn\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003edtype\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eunique\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003emin\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003emax\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003emedian\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eStandard deviation\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e\u003cp\u003eoutliers\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c9\"\u003e\u003cp\u003elower bound\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c10\"\u003e\u003cp\u003eupper bound\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGenderID\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eint64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-0.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e3.5\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAge\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e37\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e35.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e70.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e46.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e9.36\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e15.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e79.0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ebmr\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eint64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e3703\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1132\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e5837\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2071.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1019.19\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e87\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e3301.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e8557.0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFATP\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e471\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e53.6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e27.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e10.04\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-2.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e57.8\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFATM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e443\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e67.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e18.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e9.05\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e52\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-5.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e42.75\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e439\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e8.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e83.6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e46.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e8.78\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e46\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e23.95\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e71.35\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTBW\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e401\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e20.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e62.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e34.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e6.43\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e46\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e17.55\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e52.35\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eIMP\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eint64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e435\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e32\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e949\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e613.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e79.63\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e71\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e411.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e819.0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRLFATP\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e490\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e55.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e32.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e12.93\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-18.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e76.9\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRLFATM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e104\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e12.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e3.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.91\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e12\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-2.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e9.5\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRLFFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e101\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e4.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e18.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e7.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e43\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e3.45\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e12.65\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLLFATP\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e478\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e55.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e33.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e12.8\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-17.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e76.35\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLLFATM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e102\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e12.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e3.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.89\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e12\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-2.48\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e9.33\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLLFFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e97\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e4.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e17.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e7.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.61\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e48\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e3.5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e12.3\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRAFATP\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e490\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e2.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e63.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e25.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e11.48\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-10.6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e64.6\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRAFATM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e41\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e5.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.56\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e109\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e0.45\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e4.35\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRAFFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e41\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e5.6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e48\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-0.75\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e4.35\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLAFATP\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e503\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e2.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e64.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e26.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e11.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-10.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e65.75\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLAFATM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e45\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e6.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.62\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e111\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-0.6\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e4.6\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLAFFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e41\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e5.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e2.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e0.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e51\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e0.75\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e4.35\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTRFATP\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e439\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e3.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e51.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e26.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e9.18\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-2.05\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e55.95\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTRFATM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e244\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e10.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e4.4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e1.46\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e27\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e-1.85\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e10.65\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTRFFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003efloat64\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e242\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e15.7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e44.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e26.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e4.37\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e52\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e\u003cp\u003e15.0\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c10\"\u003e\u003cp\u003e38.2\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e1\u003c/span\u003e shows that the feature values vary with various ranges, and many outliers are present. Therefore, these issues are addressed using the preprocessing steps mentioned in Section 3. In ML algorithms, accuracy depends on a very important segment called data imbalance. When the data classes are imbalanced or skewed, the results are biased towards the majority class labels. Hence, we applied a synthetic minority sample generation (CTGAN) to increase the minority samples (diabetes\u0026thinsp;=\u0026thinsp;571) to equal the majority samples (healthy\u0026thinsp;=\u0026thinsp;4661). Finally, the dataset size becomes 9332 (diabetes\u0026thinsp;=\u0026thinsp;4661, and healthy\u0026thinsp;=\u0026thinsp;4661). To further understand the importance of these body composition features with class labels, we have presented a correlation matrix with a heat map in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec25\" class=\"Section2\"\u003e\u003ch2\u003e4.4. Cross-Validation\u003c/h2\u003e\u003cp\u003eThe next important step before training an ML is cross-validation. It decided how we split our data for training, validation, and testing. In this method, we have used a stratified five-fold cross-validation. In this validation, the dataset is first divided into five equal-sized subsets (folds) while ensuring that each fold maintains the same proportion of each class as the original dataset. The model is then trained five times, using four folds for training and the remaining fold for validation. This process helps reduce bias and variance, providing a more reliable estimate of model performance, especially for imbalanced datasets.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec26\" class=\"Section2\"\u003e\u003ch2\u003e4.5. ML implementation\u003c/h2\u003e\u003cp\u003eFinally, we have trained several ML algorithms simultaneously, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e2\u003c/span\u003e presents the hyperparameters used for all these models. Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e3\u003c/span\u003e shows the final performance measures of all these models on our data. From the table, it is evident that almost all the models are providing better results, but MLP has outperformed all the models in terms of every performance measure. The area under the curve plot for this model is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, and a comparative visualization of all models based on evaluation criteria is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eML models and parameters.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eParameters\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMLP\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ehidden_layer_sizes=(100,), activation=\u0026lsquo;relu\u0026rsquo;, solver=\u0026lsquo;adam\u0026rsquo;, alpha\u0026thinsp;=\u0026thinsp;0.0001, batch_size=\u0026lsquo;auto\u0026rsquo;, learning_rate=\u0026lsquo;constant\u0026rsquo;, max_iter\u0026thinsp;=\u0026thinsp;200\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eloss=\u0026lsquo;log loss\u0026rsquo;, learning_rate\u0026thinsp;=\u0026thinsp;0.1, n_estimators\u0026thinsp;=\u0026thinsp;100, max_depth\u0026thinsp;=\u0026thinsp;3, min_samples_split\u0026thinsp;=\u0026thinsp;2, min_samples_leaf\u0026thinsp;=\u0026thinsp;1, subsample\u0026thinsp;=\u0026thinsp;1.0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLightGBM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eboosting_type=\u0026lsquo;gbdt\u0026rsquo;, num_leaves\u0026thinsp;=\u0026thinsp;31, learning_rate\u0026thinsp;=\u0026thinsp;0.1, n_estimators\u0026thinsp;=\u0026thinsp;100, max_depth=-1, min_child_samples\u0026thinsp;=\u0026thinsp;20, reg_alpha\u0026thinsp;=\u0026thinsp;0.0, reg_lambda\u0026thinsp;=\u0026thinsp;0.0\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eXGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ebooster=\u0026lsquo;gbtree\u0026rsquo;, learning_rate\u0026thinsp;=\u0026thinsp;0.3, n_estimators\u0026thinsp;=\u0026thinsp;100, max_depth\u0026thinsp;=\u0026thinsp;6, min_child_weight\u0026thinsp;=\u0026thinsp;1, subsample\u0026thinsp;=\u0026thinsp;1.0, colsample_bytree\u0026thinsp;=\u0026thinsp;1.0, reg_alpha\u0026thinsp;=\u0026thinsp;0, reg_lambda\u0026thinsp;=\u0026thinsp;1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003en_estimators\u0026thinsp;=\u0026thinsp;100, criterion=\u0026lsquo;gini\u0026rsquo;, max_depth\u0026thinsp;=\u0026thinsp;None, min_samples_split\u0026thinsp;=\u0026thinsp;2, min_samples_leaf\u0026thinsp;=\u0026thinsp;1, bootstrap\u0026thinsp;=\u0026thinsp;True\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTabNet\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eoptimizer_fn\u0026thinsp;=\u0026thinsp;torch.optim.Adam, optimizer_params=\u0026lsquo;lr\u0026rsquo;: 5e-4, scheduler_fn\u0026thinsp;=\u0026thinsp;torch.optim.lr_scheduler.StepLR, scheduler_params=\u0026lsquo;step_size\u0026rsquo;:10, \u0026lsquo;gamma\u0026rsquo;:0.9, mask_type=\u0026lsquo;entmax\u0026rsquo;, max_epochs\u0026thinsp;=\u0026thinsp;100, patience\u0026thinsp;=\u0026thinsp;100, batch_size\u0026thinsp;=\u0026thinsp;256, drop_last\u0026thinsp;=\u0026thinsp;False\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLogistic Regression\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003epenalty=\u0026lsquo;l2\u0026rsquo;, C\u0026thinsp;=\u0026thinsp;1.0, solver=\u0026lsquo;lbfgs\u0026rsquo;, max_iter\u0026thinsp;=\u0026thinsp;100, multi_class=\u0026lsquo;auto\u0026rsquo;\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAdaBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003en_estimators\u0026thinsp;=\u0026thinsp;50, learning_rate\u0026thinsp;=\u0026thinsp;1.0, algorithm=\u0026lsquo;SAMME.R\u0026rsquo;\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLinear Discriminant Analysis\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003esolver=\u0026lsquo;svd\u0026rsquo;\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ecriterion=\u0026lsquo;gini\u0026rsquo;, splitter=\u0026lsquo;best\u0026rsquo;, max_depth\u0026thinsp;=\u0026thinsp;None, min_samples_split\u0026thinsp;=\u0026thinsp;2, min_samples_leaf\u0026thinsp;=\u0026thinsp;1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerformance Metrics of the proposed method using ML Models combined with CTGAN.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"8\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAUROC\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003ePrecision\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eRecall\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1 Score\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eMCC\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c8\"\u003e\u003cp\u003eGeometric Mean\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGradient Boosting\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e93.30%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e93.26%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e93.63%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e93.26%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e93.28%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e86.90%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e93.17%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLight Gradient Boosting Machine\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e93.11%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e93.08%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e93.48%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e93.08%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e93.09%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e86.55%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e92.97%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eXGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e92.87%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e92.84%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e93.14%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e92.84%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e92.85%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e85.97%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e92.76%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e92.56%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e92.53%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e92.82%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e92.53%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e92.54%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e85.35%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e92.45%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTab Net\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e92.53%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e92.52%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e93.07%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e92.52%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e92.51%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e85.58%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e92.35%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLogistic Regression\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e91.51%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e91.46%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e92.48%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e91.46%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e91.46%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e83.93%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e91.15%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAdaBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e90.16%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e90.14%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e90.34%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e90.14%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e90.14%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e80.47%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e90.08%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLinear Discriminant Analysis\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e89.91%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e89.84%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e91.61%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e89.84%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e89.80%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e81.44%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e89.28%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDecision Tree\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e87.76%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e87.77%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e87.77%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e87.77%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e87.76%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e75.54%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e87.76%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eMLP\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003e93.91%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e93.87%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e94.48%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e93.87%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e93.89%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e\u003cp\u003e\u003cb\u003e88.34%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e\u003cp\u003e\u003cb\u003e93.71%\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe experimental results demonstrate a clear ranking in model performance for diabetes prediction. The MLP model emerged as the top performer across all metrics, achieving the highest accuracy (93.91%), AUROC (93.87%), and F1 Score (93.89%). This suggests that deep neural networks are highly effective at capturing complex patterns within the dataset.\u003c/p\u003e\u003cp\u003eFollowing closely, Gradient Boosting, LightGBM, and XGBoost maintained high and competitive results, all exceeding 92% in each metric. These boosting-based methods benefit from sequential learning and error correction, making them well-suited for medical classification tasks with imbalanced or noisy data.\u003c/p\u003e\u003cp\u003eRandom Forest and TabNet also performed reliably, achieving around 92.5% in each category. While Random Forest provides robustness and interpretability, TabNet combines deep learning with attention mechanisms tailored for tabular data, maintaining accuracy while offering explainability.\u003c/p\u003e\u003cp\u003eLogistic Regression showed moderate success with 91.51% accuracy, proving that linear models still hold value in medical contexts due to their simplicity and transparency.\u003c/p\u003e\u003cp\u003e AdaBoost and LDA delivered decent but relatively lower results. While effective in some structured data scenarios, they were less competitive compared to ensemble and deep learning techniques.\u003c/p\u003e\u003cp\u003eDecision Tree, when used individually, recorded the lowest performance. Its simplicity and tendency to overfit indicate that it should preferably be used within ensemble methods for better generalization.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec27\" class=\"Section2\"\u003e\u003ch2\u003e4.6. Comparison with State-of-the-Art\u003c/h2\u003e\u003cp\u003eThe use of body composition data for diabetes prediction is a relatively new field, with only a few published studies addressing this approach. As of now, we identify three key studies that have used the same dataset of 4,661 participants to develop machine learning-based diagnostic models for diabetes. Table\u0026nbsp;\u003cspan refid=\"Tab5\" class=\"InternalRef\"\u003e4\u003c/span\u003e summarizes the comparative results of all three approaches. Also, Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e illustrates the performance comparison of different methods.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003ePerformance comparison of diabetes prediction methods on the same dataset (4,661 samples).\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"6\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eStudy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eMethod\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAccuracy (%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003ePrecision (%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eRecall (%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eF1-score (%)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNematollahi et al., [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] (2024)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eFeature Selection\u0026thinsp;+\u0026thinsp;XGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e89.96\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e90.20\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e89.65\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e89.91\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNematollahi et al., [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] (2025)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eANOVA\u0026thinsp;+\u0026thinsp;ADASYN\u0026thinsp;+\u0026thinsp;XGBoost\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e92.04\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e92.30\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e92.10\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e92.10\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eOur Method\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cb\u003eCTGAN\u0026thinsp;+\u0026thinsp;MLP\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e\u003cb\u003e93.91\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e\u003cb\u003e94.48\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e\u003cp\u003e\u003cb\u003e93.87\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e\u003cp\u003e\u003cb\u003e93.89\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe first study [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] employed a basic feature selection process followed by the XGBoost algorithm, reporting an accuracy of 89.96%. While effective to some extent, this method lacked comprehensive preprocessing and did not incorporate any oversampling strategies to address class imbalance.\u003c/p\u003e\u003cp\u003eThe second study, by Nematollahi et al., introduced a more refined hybrid model combining ANOVA for feature selection with the ADASYN oversampling technique, alongside the XGBoost classifier [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Their method achieved better performance across all evaluation metrics, with an accuracy of 92.04%, precision of 92.30%, recall of 92.10%, and F1-score of 92.10%.\u003c/p\u003e\u003cp\u003eIn contrast, our proposed method incorporates a more extensive preprocessing pipeline and leverages augmentation strategies to improve generalization. Furthermore, we evaluate multiple machine learning models with tuned hyperparameters under a rigorous validation framework. This results in the highest performance across all metrics: accuracy of 93.91%, precision of 94.48%, recall of 93.87%, and F1-score of 93.89%.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Conclusion and future work","content":"\u003cp\u003eIn conclusion, this research emphasizes the possibility of ML and DL models in assessing diabetes based on body composition data. A systematic preprocessing pipeline, including data cleaning, feature transformation, and augmentation, ensures high-quality input for classification models. Data augmentation using CTGAN ensures better data for model training. This preprocessing analysis improved the generalizability and accuracy of the predictive models. MLP outperforms other ML models in accuracy, precision, recall, and F1-score, demonstrating its effectiveness in capturing intricate relationships in the dataset. The findings underscore the potential of ML approaches in enhancing diabetes prediction. However, to ensure the employability of this model in real time requires a few add-ons. Along with the considered parameters in this work, integration of the clinical parameters such as blood glucose levels, HbA1c, family history, lifestyle factors, and genetic markers can improve the diagnostic accuracy. Applying the models to real-world clinical datasets from diverse populations and geographical locations will help validate and generalize the findings. Deploying a user interface software facilitates the frequent use of this proposed method by clinicians.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003ch2\u003eConflict of Interest:\u003c/h2\u003e\u003cp\u003eThe authors declare that they have no conflict of interest.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eEthical Approval:\u003c/strong\u003e\u003cp\u003eThis study is based on a subset of the Fasa Cohort Study conducted in Iran. The cohort study was approved by the Ethics Committee of Shiraz University of Medical Sciences. All participants provided written informed consent prior to data collection.\u003c/p\u003e\u003c/p\u003e\u003ch2\u003eFunding:\u003c/h2\u003e\u003cp\u003eThis research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eConceptualization: J.H.J, M.M, and M.A.N: Formal Analysis and Methodology, M.M: Original Draft, K.N.R, K.R.R., and S.P.V: Editing. J.H.J, P.K, and M.A.N: Supervised by P.K. All authors approved the final manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eDear Editor, I would like to sincerely thank the editor for their time and effort in handling our manuscript.I would also like to mention that I have been serving as a reviewer for Scientific Reports for several years, and I have previously published multiple articles in this journal. I deeply value the mission and standards of Scientific Reports, and I hope to continue contributing to its success.Best regards,Javad Hassannataj Joloudari, Corresponding author\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe dataset analyzed in this study is a subset of the Fasa Cohort Study conducted in Iran. Due to privacy regulations and ethical considerations, the data cannot be publicly shared. However, researchers may request access to the dataset from the corresponding author, subject to appropriate ethical approvals and institutional agreements.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eMayya, V. et al. Need for an Artificial Intelligence-based Diabetes Care Management System in India and the United States. \u003cem\u003eHealth Serv. Res. Managerial Epidemiol.\u003c/em\u003e \u003cb\u003e11\u003c/b\u003e, 23333928241275292 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKatsarou, A. et al. Type 1 diabetes mellitus. \u003cem\u003eNat. reviews Disease primers\u003c/em\u003e. \u003cb\u003e3\u003c/b\u003e (1), 1\u0026ndash;17 (2017).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhong, Z., Li, J., Clausi, D. A. \u0026amp; Wong, A. Generative adversarial networks and conditional random fields for hyperspectral image classification. \u003cem\u003eIEEE Trans. cybernetics\u003c/em\u003e. \u003cb\u003e50\u003c/b\u003e (7), 3318\u0026ndash;3329 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDeFronzo, R. A. et al. Type 2 diabetes mellitus. \u003cem\u003eNat. reviews Disease primers\u003c/em\u003e. \u003cb\u003e1\u003c/b\u003e (1), 1\u0026ndash;22 (2015).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMcIntyre, H. D. et al. Gestational diabetes mellitus. \u003cem\u003eNat. reviews Disease primers\u003c/em\u003e. \u003cb\u003e5\u003c/b\u003e (1), 47 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCamacho, M., Atehort\u0026uacute;a, A., Wilkinson, T., Gkontra, P. \u0026amp; Lekadir, K. Low-cost predictive models of dementia risk using machine learning and exposome predictors. \u003cem\u003eHealth Technol.\u003c/em\u003e \u003cb\u003e15\u003c/b\u003e (2), 355\u0026ndash;365 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSun, H. et al. IDF Diabetes Atlas: Global, regional and country-level diabetes prevalence estimates for 2021 and projections for 2045. \u003cem\u003eDiabetes Res. Clin. Pract.\u003c/em\u003e \u003cb\u003e183\u003c/b\u003e, 109119 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRoglic, G. WHO Global report on diabetes: A summary. \u003cem\u003eInt. J. Noncommunicable Dis.\u003c/em\u003e \u003cb\u003e1\u003c/b\u003e (1), 3\u0026ndash;8 (2016).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKatiyar, N., Thakur, H. K. \u0026amp; Ghatak, A. Recent advancements using machine learning \u0026amp; deep learning approaches for diabetes detection: a systematic review, \u003cem\u003ee-Prime-Advances in Electrical Engineering, Electronics and Energy\u003c/em\u003e, p. 100661, (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNimmagadda, S. M., Suryanarayana, G., Kumar, G. B., Anudeep, G. \u0026amp; Sai, G. V. A Comprehensive Survey on Diabetes Type-2 (T2D) Forecast Using Machine Learning. \u003cem\u003eArch. Comput. Methods Eng.\u003c/em\u003e \u003cb\u003e31\u003c/b\u003e (5), 2905\u0026ndash;2923 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWee, B. F., Sivakumar, S., Lim, K. H., Wong, W. K. \u0026amp; Juwono, F. H. Diabetes detection based on machine learning and deep learning approaches. \u003cem\u003eMultimedia Tools Appl.\u003c/em\u003e \u003cb\u003e83\u003c/b\u003e (8), 24153\u0026ndash;24185 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMasson, G., Morais, F., Rocha, E. \u0026amp; Endo, P. T. Evaluation of artificial intelligence models for predicting low birth weight using Brazilian real data. \u003cem\u003eHealth Technol.\u003c/em\u003e \u003cb\u003e15\u003c/b\u003e (1), 169\u0026ndash;184 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNematollahi, M. A. et al. Evolution of diabetes prediction using the fusion of ANOVA, ADASYN technique and XGBoost based on body composition data. \u003cem\u003eJ. Diabetes Metabolic Disorders\u003c/em\u003e. \u003cb\u003e24\u003c/b\u003e (2), 1\u0026ndash;11 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGuariguata, L., Whiting, D., Weil, C. \u0026amp; Unwin, N. The International Diabetes Federation diabetes atlas methodology for estimating global and national prevalence of diabetes in adults. \u003cem\u003eDiabetes Res. Clin. Pract.\u003c/em\u003e \u003cb\u003e94\u003c/b\u003e (3), 322\u0026ndash;332 (2011).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eEscobedo-de la Pe\u0026ntilde;a, J., Ram\u0026iacute;rez-Hern\u0026aacute;ndez, J. A., Fern\u0026aacute;ndez-Ramos, M. T., Gonz\u0026aacute;lez-Figueroa, E. \u0026amp; Champagne, B. Body fat percentage rather than body mass index related to the high occurrence of type 2 diabetes. \u003cem\u003eArch. Med. Res.\u003c/em\u003e \u003cb\u003e51\u003c/b\u003e (6), 564\u0026ndash;571 (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLee, B. J. \u0026amp; Kim, J. Y. Identification of type 2 diabetes risk factors using phenotypes consisting of anthropometry and triglycerides based on machine learning. \u003cem\u003eIEEE J. biomedical health Inf.\u003c/em\u003e \u003cb\u003e20\u003c/b\u003e (1), 39\u0026ndash;46 (2015).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhu, A. et al. Correlation of abdominal fat distribution with different types of diabetes in a Chinese population, \u003cem\u003eJournal of Diabetes Research\u003c/em\u003e, vol. no. 1, p. 651462, 2013. (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eElgendy, I. A., Hosny, M., Albashrawi, M. A. \u0026amp; Alsenan, S. Dual-stage explainable ensemble learning model for diabetes diagnosis. \u003cem\u003eExpert Syst. Appl.\u003c/em\u003e \u003cb\u003e274\u003c/b\u003e, 126899 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDutta, A. et al. Early prediction of diabetes using an ensemble of machine learning models. \u003cem\u003eInt. J. Environ. Res. Public Health\u003c/em\u003e. \u003cb\u003e19\u003c/b\u003e (19), 12378 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNaseem, A. et al. Novel Internet of Things based approach toward diabetes prediction using deep learning models. \u003cem\u003eFront. Public. Health\u003c/em\u003e. \u003cb\u003e10\u003c/b\u003e, 914106 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNematollahi, M. A. et al. A cohort study on the predictive capability of body composition for Diabetes Mellitus using machine learning. \u003cem\u003eJ. Diabetes Metabolic Disorders\u003c/em\u003e. \u003cb\u003e23\u003c/b\u003e (1), 773\u0026ndash;781 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eReza, M. S., Hafsha, U., Amin, R., Yasmin, R. \u0026amp; Ruhi, S. Improving SVM performance for type II diabetes prediction with an improved non-linear kernel: Insights from the PIMA dataset. \u003cem\u003eComput. Methods Programs Biomed. Update\u003c/em\u003e. \u003cb\u003e4\u003c/b\u003e, 100118 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, Z. et al. A novel evolutionary ensemble prediction model using harmony search and stacking for diabetes diagnosis. \u003cem\u003eJ. King Saud University-Computer Inform. Sci.\u003c/em\u003e \u003cb\u003e36\u003c/b\u003e (1), 101873 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCheng, H., Zhu, J., Li, P. \u0026amp; Xu, H. Combining knowledge extension with convolution neural network for diabetes prediction. \u003cem\u003eEng. Appl. Artif. Intell.\u003c/em\u003e \u003cb\u003e125\u003c/b\u003e, 106658 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMushtaq, Z. et al. Voting Classification-Based Diabetes Mellitus Prediction Using Hypertuned Machine‐Learning Techniques, \u003cem\u003eMobile Information Systems\u003c/em\u003e, vol. no. 1, p. 6521532, 2022. (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNurzari, I., Sari, E., Harris, D. I., Priyatno, A. M. \u0026amp; Rusnedy, H. Inter-Cluster Distance-Based SMOTE Modification for Enhanced Diabetes Classification, \u003cem\u003eITEGAM-JETIA\u003c/em\u003e, vol. 11, no. 51, pp. 190\u0026ndash;196, (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLu, H., Uddin, S., Hajati, F., Moni, M. A. \u0026amp; Khushi, M. A patient network-based machine learning model for disease prediction: The case of type 2 diabetes mellitus. \u003cem\u003eAppl. Intell.\u003c/em\u003e \u003cb\u003e52\u003c/b\u003e (3), 2411\u0026ndash;2422 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFarjam, M. et al. A cohort study protocol to analyze the predisposing factors to common chronic non-communicable diseases in rural areas: Fasa Cohort Study. \u003cem\u003eBMC public. health\u003c/em\u003e. \u003cb\u003e16\u003c/b\u003e, 1\u0026ndash;8 (2016).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRavuri, S. \u0026amp; Vinyals, O. Classification accuracy score for conditional generative models. \u003cem\u003eAdvances neural Inform. Process. systems\u003c/em\u003e, \u003cb\u003e32\u003c/b\u003e, (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSteinberg, D. CART: classification and regression trees, in The top ten algorithms in data mining: Chapman and Hall/CRC, 193\u0026ndash;216. (2009).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRuggieri, S. \u0026amp; Efficient, C. 5 [classification algorithm]. \u003cem\u003eIEEE Trans. Knowl. Data Eng.\u003c/em\u003e \u003cb\u003e14\u003c/b\u003e (2), 438\u0026ndash;444 (2002).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHuang, H., Lin, T. K. \u0026amp; Ngui, P. Analysing a mental health survey by chi-squared automatic interaction detection, \u003cem\u003eAnnals of The Academy of Medicine, Singapore\u003c/em\u003e, vol. 22, no. 3, pp. 332\u0026ndash;337, (1993).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLee, S. \u0026amp; Park, I. Application of decision tree model for the ground subsidence hazard mapping near abandoned underground coal mines. \u003cem\u003eJ. Environ. Manage.\u003c/em\u003e \u003cb\u003e127\u003c/b\u003e, 166\u0026ndash;176 (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTuysuzoglu, G. et al. Joint Tomek Links (JTL): An Innovative Approach to Noise Reduction for Enhanced Classification Performance. \u003cem\u003eIEEE Access\u003c/em\u003e, (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLu, M., Sadiq, S., Feaster, D. J. \u0026amp; Ishwaran, H. Estimating individual treatment effect in observational data using random forest methods. \u003cem\u003eJ. Comput. Graphical Stat.\u003c/em\u003e \u003cb\u003e27\u003c/b\u003e (1), 209\u0026ndash;219 (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOshiro, T. M., Perez, P. S. \u0026amp; Baranauskas, J. A. How many trees in a random forest? in \u003cem\u003eMachine Learning and Data Mining in Pattern Recognition: 8th International Conference, MLDM 2012, Berlin, Germany, July 13\u0026ndash;20, 2012. Proceedings 8\u003c/em\u003e, : Springer, pp. 154\u0026ndash;168. (2012).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBender, R. \u0026amp; Grouven, U. Ordinal logistic regression in medical research. \u003cem\u003eJ. R. Coll. Physicians Lond.\u003c/em\u003e \u003cb\u003e31\u003c/b\u003e (5), 546 (1997).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDevika, S., Jeyaseelan, L. \u0026amp; Sebastian, G. Analysis of sparse data in logistic regression in medical research: A newer approach. \u003cem\u003eJ. Postgrad. Med.\u003c/em\u003e \u003cb\u003e62\u003c/b\u003e (1), 26\u0026ndash;31 (2016).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBent\u0026eacute;jac, C., Cs\u0026ouml;rgő, A. \u0026amp; Mart\u0026iacute;nez-Mu\u0026ntilde;oz, G. A comparative analysis of gradient boosting algorithms. \u003cem\u003eArtif. Intell. Rev.\u003c/em\u003e \u003cb\u003e54\u003c/b\u003e, 1937\u0026ndash;1967 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSai, M. J. et al. An ensemble of Light Gradient Boosting Machine and adaptive boosting for prediction of type-2 diabetes. \u003cem\u003eInt. J. Comput. Intell. Syst.\u003c/em\u003e \u003cb\u003e16\u003c/b\u003e (1), 14 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDong, W., Huang, Y., Lehane, B. \u0026amp; Ma, G. XGBoost algorithm-based prediction of concrete electrical resistivity for structural health monitoring. \u003cem\u003eAutom. Constr.\u003c/em\u003e \u003cb\u003e114\u003c/b\u003e, 103155 (2020).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang, H. et al. Enhancing predictive accuracy for urinary tract infections post-pediatric pyeloplasty with explainable AI: an ensemble TabNet approach. \u003cem\u003eSci. Rep.\u003c/em\u003e \u003cb\u003e15\u003c/b\u003e (1), 2455 (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang, C., Xu, S. \u0026amp; Yang, J. Adaboost algorithm in artificial intelligence for optimizing the IRI prediction accuracy of asphalt concrete pavement, \u003cem\u003eSensors\u003c/em\u003e, vol. 21, no. 17, p. 5682, (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhu, F., Gao, J., Yang, J. \u0026amp; Ye, N. Neighborhood linear discriminant analysis. \u003cem\u003ePattern Recogn.\u003c/em\u003e \u003cb\u003e123\u003c/b\u003e, 108422 (2022).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Diabetes prediction, Body composition data, CTGAN, Machine learning, Deep learning, Multilayer perceptron","lastPublishedDoi":"10.21203/rs.3.rs-7344799/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7344799/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eAccurately assessing the risk of diabetes is essential for early intervention and effective management. This study explores the potential of Machine Learning (ML) and Deep Learning (DL) models to analyze body composition measurements as predictors for diabetes screening. We begin by carefully preprocessing the dataset, handling missing values, encoding categorical variables, and classifying features to prepare the data for modeling. To enhance the dataset and improve model generalization, we implemented Conditional Tabular Generative Adversarial Networks (CTGAN) for data augmentation. The dataset is then split using stratified five-fold cross-validation to ensure balanced and reliable evaluation. We evaluate ten different ML models simultaneously, such as Multilayer Perceptron (MLP), Gradient Boosting, Random Forest, Logistic Regression, Decision Tree, LightGBM, TabNet, XGBoost, AdaBoost, and Linear Discriminant Analysis (LDA). The proposed approach, which integrates CTGAN-based augmentation with these diverse models, achieves strong predictive results. Among the models tested, MLP stands out with the best performance, reaching an accuracy of 93.91%. Other metrics also confirm its strength: AUROC at 93.87%, precision at 94.48%, recall at 93.87%, F1 score at 93.89%, Matthews Correlation Coefficient at 88.34%, and geometric mean at 93.71%. These results demonstrate that our combined methodology effectively captures complex relationships within body composition data and offers a reliable tool to support clinical decision-making in diabetes risk assessment. Future work may integrate additional clinical parameters to further enhance prediction accuracy and applicability in real-world settings.\u003c/p\u003e","manuscriptTitle":"Predicting Diabetes Mellitus using Conditional Tabular Generative Adversarial Networks combined with MLP based on Body Composition Data","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-12 12:38:26","doi":"10.21203/rs.3.rs-7344799/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-09-15T07:00:07+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-09-13T04:15:48+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-09-12T06:22:46+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-09-10T18:11:11+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-09-09T12:26:36+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-09-08T01:40:58+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"42321657097363651862533613889360560702","date":"2025-09-07T18:51:21+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-09-07T17:05:27+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"96661051694828078445533263153689787605","date":"2025-09-07T13:58:56+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"134528751620857147541866135145930140998","date":"2025-09-07T13:31:43+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"249931536279958236346088344423165678133","date":"2025-09-07T13:28:02+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"259361721627423025307053076481229566089","date":"2025-09-07T12:32:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"47493111739155900201768865026982301038","date":"2025-09-07T12:18:27+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-09-07T12:16:29+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-09-02T11:30:13+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-08-22T13:41:10+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-08-21T05:09:00+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-08-19T22:28:03+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"dcfbad3e-ef10-47c1-8204-54ceeedcce32","owner":[],"postedDate":"September 12th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":54527541,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":54527542,"name":"Health sciences/Diseases"},{"id":54527543,"name":"Health sciences/Health care"},{"id":54527544,"name":"Physical sciences/Mathematics and computing"},{"id":54527545,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2025-12-15T16:12:25+00:00","versionOfRecord":{"articleIdentity":"rs-7344799","link":"https://doi.org/10.1038/s41598-025-31928-9","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-12-10 15:58:59","publishedOnDateReadable":"December 10th, 2025"},"versionCreatedAt":"2025-09-12 12:38:26","video":"","vorDoi":"10.1038/s41598-025-31928-9","vorDoiUrl":"https://doi.org/10.1038/s41598-025-31928-9","workflowStages":[]},"version":"v1","identity":"rs-7344799","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7344799","identity":"rs-7344799","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00