HybGANN: A Hybrid GAN-GA-ANN Framework for Predicting Diabetes from Imbalanced Medical Data | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Method Article HybGANN: A Hybrid GAN-GA-ANN Framework for Predicting Diabetes from Imbalanced Medical Data Nora PireciSejdiu, Blagoj Ristevski This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7300855/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The digitization of medical data has enabled large-scale analysis. However, clinical datasets, such as those used for diabetes prediction, often have class imbalances, with disease cases significantly underrepresented. This imbalance poses a major challenge for traditional machine learning models, which tend to favor the majority classes. In addition, many high-performance models operate as black boxes, limiting their adoption in clinical practice due to their lack of interpretability. In this paper, we present HybGANN, a novel hybrid framework that integrates Conditional Tabular Generative Conditional Networks (CTGAN) for synthetic minority data generation, a unique hybrid genetic algorithm (GA) that co-evolves hyperparameters and internal weights from artificial neural networks (ANNs) in a Lamarckian fashion, and SHapley Additive Explanations (SHAP) for post-hoc model interpretability. In contrast to previous work, to the best of our knowledge, this is the first application of a Lamarckian GA for the optimization of node weights and hyperparameters in tabular medical data classification. HybGANN creates a semi-automated workflow that improves predictive performance while providing transparency and adaptability. Applied to a large-scale diabetes dataset, experiments have demonstrated that the HybGANN model outperforms a benchmark ANN network that also uses the same CTGAN pre-balanced dataset on all key classification metrics. The framework achieves a ROC-AUC value of 0.9184 and a PR-AUC value of 0.9268, demonstrating its effectiveness and potential as a reliable AI solution for clinical decision support in imbalanced medical fields. CTGAN Lamarckian GA ANN SHAP Medical data mining Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 1. Introduction In the medical field, the prediction of chronic diseases poses a major challenge due to the nature of medical data, which often present imbalances and a diverse number of features. This directly affects the performance of machine learning (ML) algorithms and ANN, limiting their ability to accurately identify clinical cases, especially those belonging to the minority class, i.e. patients with positive diagnoses. In this context, advanced data balancing techniques are essential to obtain more accurate and reliable results [ 1 ]. In the literature, several methods have been proposed to address the challenges of class imbalance that include the use of data balancing techniques such as SMOTE, SMOTETomek, Adasyn, etc. [ 1 , 7 , 8 ], while more recent developments emphasize generative models, such as Generative Adversarial Networks (GANs), to generate synthetic samples that realistically reflect the original data distribution. The most well-known models for tabular data that have yielded better results include: Wgan-GP, TabFairGan and especially CTGAN (Conditional Tabular GAN), which is designed to address the challenges of tabular data with combinations of categorical and numeric features, while preserving the complex relationships between them [ 9 , 10 , 11 , 12 , 13 , 14 ]. Simultaneously, the optimization of ANN using evolutionary algorithms, especially genetic algorithms (GA) [ 3 ], has demonstrated high efficiency in automatically determining hyperparameters and initial weight configurations, improving the convergence process and the final performance of the models. However, most previous studies apply GA to hyperparameter selection or feature selection, while the inclusion of a full optimization of weights and hyperparameters in the same process is still an emerging field [ 4 , 5 , 6 , 27 ]. In this paper, we present a novel and innovative model termed HybGANN (Hybrid GA with ANN using GAN) that represents a hybrid approach to a GA that in parallel optimises weights and tunes hyperparameters to train an ANN. This improves the prediction of diabetes in the early stages of the disease by integrating advanced data augmentation techniques into imbalanced medical datasets. This Lamarckian approach, which uses weight updates and hyperparameter tuning during training, to reflect changes in future generations, has not previously been used in the literature for diabetes prediction with imbalanced data. The model was trained on a large-scale real-world diabetes prediction dataset from the UCI Machine Learning Repository [ 35 ], which contains over 250,000 samples with an imbalance ratio of approximately 6:1 in favor of the negative (non-diabetic) class. To address this imbalance, CTGAN is used to generate synthetic samples only for the minority class, creating a balanced dataset. The quality of the synthetic data is assessed by visual analysis such as Principal Components Analysis (PCA) and statistical measures such as Jensen-Shannon Divergence (JSD), which confirm a high approximation to the original data [ 22 ]. Empirical results have shown that HybGANN significantly outperforms basic neural network models trained without evolutionary optimization and data balancing. Improvements are observed in all key metrics, especially in recall and F1 score, crucial in medical applications to minimize errors in the classification of patients with diabetes [ 2 ]. Furthermore, interpretability analysis with SHapley Additive exPlanations (SHAP) demonstrates that the model is not only accurate, but also transparent, identifying the most important features that affect its decisions. The main contributions of our work are as follows: We propose HybGANN, a novel hybrid framework that integrates CTGAN for class balancing and a GA for simultaneous optimization of neural network weights and hyperparameter tuning. To the best of our knowledge, this two-level GA optimization approach has not been previously applied to diabetes prediction tasks. We demonstrate that integrating GA at both levels - optimizing internal network weights and tuning critical hyperparameters such as learning rate and group size - results in significantly better predictive performance compared to a traditional ANN trained on CTGAN-balanced data. We undertake an extensive evaluation on a large and highly imbalanced diabetes dataset, achieving improved performance in multiple metrics, especially in terms of recall and F1 score, which are essential in medical diagnostics where accurate identification of diabetic patients is crucial. A detailed comparison with the ANN baseline highlights the effectiveness of the HybGANN approach in reducing false negatives while maintaining a high true positive rate, making it more reliable for early diabetes detection. We support our findings through comprehensive visualization techniques, such as confusion matrices, ROC and PR curves, and SHAP summary plots, ensuring that our model is not only robust but also interpretable, transparent, and reusable for future research on imbalanced medical classification. The remaining sections of this work are organized as follows. In Section 2, we discuss the existing methods for handling non-balanced data and the use of GA and neural networks in disease diagnosis. The subsequent section 3 explains the HybGANN model in detail, including the neural network architecture, the Lamarckian weight optimization mechanism and the hyperparameter selection process. Empirical results on the diabetes prediction dataset are presented in section 4, comparing the performance of HybGANN with benchmark models and alternative data balancing methods. Section 5 addresses the practical implications of the findings, limitations of the study and provides suggestions for future research. 2. Related Work Recent research has explored the integration of advanced ML, deep learning (DL) and generative models, showing improvement in disease diagnosis, addressing common challenges such as data class imbalance, noise and a large number of features. In [ 28 ], the authors created a deep neural network for the detection of chronic kidney disease (CKD) using Recursive Feature Elimination (RFE) for all critical features such as Haemoglobin and serum level. The 12-layer neural network was shown to outperform conventional classifiers such as Support Vector Machine, K-Nearest Neighbors, Log Regression, Random Forest and Naïve Bayes, achieving 100% accuracy. Similarly, several studies have focused on diabetes prediction using hybrid models and data augmentation strategies. In the paper [ 29 ], the authors proposed a new resource for diabetes classification. The proposed framework uses SMOTEENN to balance the dataset and introduces a DCSGAN classifier that is based on the GAN model for synthetic data generation, achieving an accuracy of 96.27%. This study also performed feature analysis using logistic regression and identified critical biomarkers such as glucose, body mass index (BMI), and Diabetes Pedigree Function. Another contribution to diabetes prediction is the GLSTM model [ 30 ], which combines GAN-based tabular data augmentation with an LSTM network for classification. The study addressed the challenges associated with sensitive and spatial data by generating different synthetic sets using multiple GAN architectures (CTGAN, Vanilla GAN, Gaussian Copula GAN, etc.). The synthetic data showed a strong correlation (0.93) with the original data. Trained on both real and synthetic data, the GLSTM model achieved 97% accuracy, outperforming models trained solely on real data. This demonstrated the value of synthetic data in protecting patient privacy, keeping care activities private, and improving prediction performance. In paper [ 31 ], the authors used an enhanced adaptive genetic algorithm (EAGA) to avoid irrelevant features in medical datasets before adding them to a Multilayer Perceptron (MLP) classifier. This GA model adjusted crossover and mutation probabilities and refined the fitness function for better feature selection. The EAGA-MLP model achieved a high accuracy of 97.76% ensuring the effectiveness of the model in different clinical scenarios. Also, the study in paper [ 32 ] proposed hybrid approaches - ANN-GA and CART-GA - by integrating GAs in these modes of ANN and Classification and Regression Tree to optimise their parameters. Tested on the Pima Indian Diabetes dataset, the CART-GA variant achieved superior performance, recording an accuracy of 96.05% in expectation validation and 93.42% in 10-fold validation. The use of GA improved the performance, showing the impact of optimisation over a predictive model. GAN models have also been effectively used in other areas such as intrusion detection and cancer prediction. In paper [53], the challenge of imbalanced datasets in Intrusion Detection Systems (IDS) was addressed by using GANs to synthesize sparse attack data. ACGAN and ACGAN-SVM models were used to generate realistic attack samples, which were then augmented with real datasets (NSL-KDD, CICIDS2017, etc.) to train classifiers such as Decision Trees, Random Forests, and SVM. The results showed that the datasets augmented by GANs improved the classifier performance more than traditional re-modeling techniques such as SMOTE-SVM or Tomek Links. ACGAN-SVM was particularly successful in filtering noisy synthetic data, thereby increasing detection accuracy. In the field of cancer diagnosis, paper [ 34 ] has proposed an improved Wasserstein GAN (WGAN) for generating synthetic minority samples in cancer gene expression datasets, which often suffer from severe class imbalance. The study replaced convolutional layers with deep fully connected networks to better fit numerical non-imaging gene expression data. Feature selection was performed through differential gene expression analysis before training the WGAN. Results on three RNA-seq datasets (breast, lung, and gastric cancer) showed that WGAN outperformed traditional methods such as random oversampling and SMOTE in all metrics, including precision, recall, F1 score, and AUC value. This confirmed the ability of WGAN to generate meaningful synthetic samples that improve classification tasks. 3. Methodology 3.1. Data Preparation and GAN-Based Balancing In this article, we used the "Diabetes Health Indicators" dataset from the UCI Machine Learning Repository (data ID: 891 [ 35 ]. This dataset contains 253.680 cases and 21 features, including demographic variables such as age, gender, educational level, etc., behavioral attributes such as smoking, physical activity, alcohol consumption, and health indicators such as physical health, mental health, and BMI. The feature “Diabetis_binary” represents the target indicating positive cases of diabetes if 1 and negative cases if 0. The biggest challenge in this dataset is the imbalance of classes in a ratio of approximately 6:1 in favour of majority class. This imbalance can affect the performance of ML models, especially in accurately identifying cases in minority classes that are diabetes positive. The identification of these cases in medicine is of particular importance, and more importance is placed on detecting these cases than on detecting negative cases, as most agree that it is better to classify a patient as positive for a given disease and not be affected, than to classify them as negative and be sick and not receive proper treatment promptly. 3.1.1 Data Collection and Preprocessing Data analysis is used to understand the distribution of features, identify correlations between variables and detect possible outliers. Data normalisation ensures equality in feature scaling, features are normalized, as this process facilitates efficient convergence during model training and prevents features with larger scales from disproportionately affecting the learning process. Data splitting is pre-processed and divided into training (70%), validation (15%) and testing (15%) subsets. This technique preserves the original class distribution in all subsets. 3.1.2 Tools and Technologies The implementation was carried out using the Python programming language. Several libraries and frameworks were used to support the different stages of the process. TensorFlow and Keras were used to build and train Artificial Neural Network (ANN) models. Scikit-learn was used for data preprocessing and model evaluation through various metrics. NumPy and Pandas were used for efficient data manipulation and analysis. Matplotlib and Seaborn were used for results visualization. The DEAP library was used to implement genetic algorithms (GA), while SDV (Synthetic Data Vault) was used to generate synthetic data using GAN (Generative Adversarial Networks) models. 3.1.3 Addressing Class Imbalance through Data Augmentation Given the significant class imbalance, Conditional Tabular GAN (CTGAN) is implemented on the training set to generate synthetic samples for the minority class as shown in Fig. 1 . CTGAN is an extension of the classical GAN architecture for synthetic data generation, specifically designed for tabular data with a combination of categorical and numeric attributes. The model consists of two submodules: a generator that endeavours to create new synthetic examples that mimic the real data, and a discriminator that aims to distinguish between the real and generated examples [ 12 ]. CTGAN trains the generator conditionally, considering the distribution of categorical attributes. Initially, categorical features such as smoker, physical activity, sex, education, income, etc., are specified and presented to the model for processing during training. This is a unique feature of CTGAN, as it handles categorical features using a conditional approach that preserves their original distribution and avoids creating unrealistic combinations. First, the number of samples to be generated to equalize the classes is determined, and then synthetic samples are generated only for the minority class. The model is trained for 1000 epochs, during which it learns the distribution of the existing data. The synthetic data are selected and combined with the real data to create a new balanced data set. To assess the quality of the synthetic data generated by CTGAN, the visual PCA along the first two principal components was performed. In the PCA plot Fig. 1 (a), a significant overlap between the real (blue) and synthetic (orange) data points is observed, suggesting that CTGAN correctly captured the common structure of the original feature space, including the complex relationships between categorical and continuous attributes [ 21 ]. To complement the visual analysis, the JSD between the real and synthetic data distributions for each numerical variable was also calculated. Most of the features yielded JSD scores below 0.1, indicating a high degree of similarity and confirming that the synthetic samples are statistically consistent with the real data. These findings validate the use of CTGAN-generated samples as reliable minority class complements for addressing class imbalance in subsequent classification tasks [ 12 ]. 3.2 Development of the HybGANN Model The main novelty of this research lies in the development of the HybGANN model, which is trained on balanced data by GAN. It integrates GA into ANN optimization. GA optimizes ANN hyperparameters, such as learning rate, number of hidden layers, and activation functions, to improve model performance [ 18 ]. GAs are also used to determine the optimal initial connection weights for ANN, which facilitates better convergence and predictive accuracy [ 19 ]. 3.2.1 ANN Architecture The predictive model is a forward-propagating ANN with the following structure: Input layer matching the number of features. Two hidden layers with 128 and 64 neurons, each followed by ReLU (Rectified Linear Unit) activation function. Output layer with a single sigmoid neuron for binary classification. Regularization techniques include: Dropout applied during final training to avoid overfitting. L2 regularization in dense layers. ANN weights and hyperparameters are optimized using a GA described below. 3.2.2 GA for joint hyperparameter and weight optimization A novelty of the HybGANN framework is the use of a hybrid GA to simultaneously develop both the training hyperparameters and the internal weights of the ANN. While previous studies typically apply GA to fit hyperparameters or perform feature selection, our approach uniquely encodes the full set of ANN weights along with the key training parameters on a single chromosome, a design rarely addressed in the current literature for medical or tabular datasets. This allows the GA to act not only as a black-box hyperparameter tuner but as a true evolutionary learner, optimizing the model from initialization to architecture-level learning, a novel contribution in the field of neuroevolution for imbalanced tabular data [ 18 , 19 , 20 ]. Each individual in the GA population is a composite chromosome encoding: RNA flattened weight vector (spanning all layers). Tensor shapes and sizes to reconstruct the original weight matrices. Learning rate ∈ [1e-5, 1e-2] (sampled in a uniform logarithmic fashion). Batch size ∈ {15, 16, 17, 32, 64} This compact coding allows the seamless integration of low-level neural parameters and high-level training dynamics into a single evolutionary process. Another distinctive feature is the use of Lamarckian learning, in which each chromosome, after its fitness evaluation, is updated with the weights learned during its local training session [ 27 ]. This allows the learned representations to be inherited by offspring, accelerating convergence and improving the fitness of the population over time. The hybrid optimization strategy has several important implications for convergence and generalization [ 4 , 5 ], as detailed below: This hybrid weight and hyperparameter GA: Explores fine-grained local minima via weight mutation. Adjusts macro-level learning dynamics via learning rate and batch size evolution. Maintains generalization and adaptability between datasets without the need for manual adjustments. Each chromosome is initialized with random weights and randomly sampled hyperparameters. The ANN architecture remains fixed. The fitness of a chromosome is evaluated as follows: The encoded weights are unfolded and assigned to the ANN. The model is compiled using the learning rate of the chromosome and trained for a small number of local epochs in a stratified split of the training set. Validation accuracy is used as a fitness index. After evaluation, the updated model weights are unfolded and stored back in the chromosome (a Lamarckian approach), allowing the learned representations to be passed on to the next generation. The genetic operators used in the GA are as follows: Selection: Tournament selection is used to choose parents according to their fitness. Crossover: Combines learning rates of parents (geometric mean), mixes lot sizes and averages weights to create offspring. Mutation: Randomly perturbs the learning rate, lot size or injects Gaussian noise into the weights. The best-performing individuals (elites) are preserved across generations to avoid regression. The population iteratively evolves over G generations (default value: 50), which improves validation performance. After evolution, the best-performing chromosome is selected. The corresponding ANN is reconstructed with: Evolved weights Evolved hyperparameters Additional dropout and L2 regularization The model is trained by early stopping on the combined training and validation sets, and then evaluated on the retained test set. 3.5. Model Evaluation The evaluation of the model is based on the well-known metrics such as: accuracy, recall, F1 score, AUC-ROC value and AUC-PR value. In addition to numerical scores, PR curves, ROC curves and prediction distributions are generated to visualize performance [ 25 , 26 ]. To improve transparency, SHAP is used to analyze the contribution of each feature to the model predictions [ 23 , 24 ]. Evaluation also includes the following aspects: Baseline model comparison: The performance of the proposed HybGANN model is compared to a baseline ANN trained on the same balanced dataset with CTGAN, but without GA optimization. Confusion matrix analysis: Confusion matrices are used to examine the distribution of true positives, false positives, true negatives, and false negatives, providing a detailed insight into the types of errors made by the model. Early stopping and generalization monitoring: To prevent overfitting, early stopping criteria are applied based on validation loss during training. Comparison of distribution between real and synthetic data: The quality of synthetic samples generated by CTGAN is evaluated using JSD, to verify whether the synthetic data closely resembles real samples of the minority class. Importance of PR AUC value: Given the imbalanced data, PR AUC value is given more weight than ROC AUC value in the final model ranking, as it better reflects the model's ability to identify the minority class correctly. 4. Experimental Setup and Results 4.1 Dataset Balancing with GANs The proposed model is developed and tested on a large-scale real-world diabetes dataset comprising 253.680 cases and 21 different features, which contained a significant disparity between majority and minority classes, which significantly prevent the proper training of classification models. To address this problem, the CTGAN (Conditional Tabular GAN) model, which specializes in generating synthetic tabular data with interdependent features and heterogeneous distributions, was used. Visual analysis using PCA showed significant overlap between the real (blue) and synthetic (orange) data, especially in the denser areas of the principal components, suggesting that CTGAN effectively learned the subtheme structure of the real data as shown in Fig. 2 (a). In addition, the wider distribution of synthetic points at the edges of PC1 and PC2 indicates that the model generated greater diversity, including underrepresented regions of the data space, a desirable feature for reducing overfitting and improving generalization of models such as ANNs. Quantitative assessment was performed using the JSD metrics. As presented in Fig. 2 (b), 17 of 21 numerical features resulted in JSD < 0.1, indicating high statistical similarity between the real and synthetic distributions. In particular, critical health indicators such as High Blood Pressure (HighBP), Body Mass Index (BMI), and History of Coronary Heart Disease or Heart Attack (HeartDiseaseorAttack) were generated with high accuracy. However, three features: Age (JSD = 0.3523), MentHlth (JSD = 0.2613), and PhysHlth (JSD = 0.2294), has shown high divergence, which may be a consequence of their complex and extreme distributions in the real dataset. These results suggest that, although CTGAN generated synthetic data with almost excellent agreement on most features, further improvements can be achieved by techniques such as conditional sampling or discretisation for the most concerning features. 4.2 Baseline ANN Performance To set a reference point for classification performance without the aid of evolutionary optimization, a baseline ANN model was built using a fixed architecture with two hidden layers: 128 and 64 neurons, respectively. The activation function used was ReLU, while the output layer contained a single neuron with sigmoid activation for binary classification. The training process included early stopping with validation loss monitoring to avoid overfitting. The model was trained on a balanced dataset with CTGAN-generated synthetic data. The baseline ANN model achieved an accuracy of 0.7804, a precision of 0.6809, a recall of 0.5383, an F1 score of 0.6012, an ROC AUC value of 0.8294, and PR AUC value of 0.6877. As seen in Fig. 3 (a), results show moderate performance, with good discriminant ability (AUC ROC > 0.82) and acceptable recall (recall = 0.5383), which is crucial in medical tasks where the omission of positive cases must be minimized. As shown in Fig. 3 (b), the AUC value of PR greater than 0.68 suggests that the model performs consistently even in the presence of class disparities. Fig. 3 (d) shows the confusion matrix of the reference ANN model. The results indicate that the true positives (TP = 7833) indicate that the model succeeds in identifying a significant number of samples with diabetes, but the false negatives (FN = 6719) and false positives (FP = 3671) remain high. As seen in Fig. 3 (c) the baseline ANN model has significant overlap between positive and negative classes, with many positive cases (Diabetes) having a probability ≤ 0.5, resulting in low recall and false negatives. To explore the decision-making basis of the model and to understand the relative importance of the features, SHAP analysis was used. The Fig. 4 shows the impact of each feature on model decisions for each patient: - GenHlth was the most influential feature. Its high values increase the predicted probability of diabetes (positive SHAP), lower values pushed the prediction toward the non-diabetic class - HighChol and HighBP also contribute significantly to skew the model toward the positive class, especially when patients present high values for these features. High BMI has a significant positive effect on the probability of diabetes, while high physical PhysActivity acting as a protective factor (reflected by negative SHAP values). Age and income show that older and low-income individuals are more likely to be classified as diabetic, which is consistent with the epidemiological literature. Other factors such as HeartDiseaseorAttack, Stroke, DiffWalk and PhysHlth also increase the probability of being diagnosed with diabetes, reflecting the coexistence of health conditions. On the other hand, features such as CholCheck, Smoker, HvyAlcoholConsump and NoDocbcCost had limited impact on the model's decisions, ranking last on the importance list according to SHAP. In the analysis of the baseline ANN model, several features turned out to be the main contributors to the model's decision-making. Specifically, features such as GenHlth, HighChol, HighBP, BMI, and Age had the greatest positive impact on the likelihood of the positive class (i.e., people with diabetes). For example, high values of GenHlth (implying poor general health) drove the model's decision toward a diagnosis of diabetes, whereas low values had a protective effect. On the other hand, features such as PhysActivity and Income had a opposite impact on classification, where higher values of physical activity or higher income contributed to a decrease in the probability of being diabetic. To better understand the effect of data balancing with CTGAN, the JSD divergence between the real and synthetic distributions was analyzed for several features and compared with its impact on the ANN model using SHAP. - Age (JSD = 0.3523). One of the features with the largest divergence between real and synthetic data. SHAP also confirmed a large impact on decision making, indicating that CTGAN added useful variation to this feature and that the ANN model used this information to improve classification. - PhysHlth (JSD = 0.2294). This feature also showed high agreement between JSD and SHAP, with high values contributing significantly to classification. This supports the idea that CTGAN provided useful data that improved class distinction. - MentHlth (JSD = 0.2613). While this feature exhibited significant JSD divergence, its impact on SHAP was more moderate. This may indicate that although CTGAN created new variation in this feature, the model did not identify it as important for decision making, perhaps due to lack of direct correlation with class or complex interactions with other features. 4.3 Proposed Hybrid Model HybGANN To improve the performance of the classification model, we propose HybGANN, a novel hybrid approach. This approach preserves acquired skills and significantly improves convergence to the best solutions. Once the evolutionary process was completed, the best configuration found by the algorithm was used for the final training of the ANN model with the combined training and validation data. The model achieved an accuracy of 0.8960, a recall of 0.7510, an F1 score of 0.8171, ROC AUC value of 0.9184, and PR AUC value of 0.9268. The confusion matrix of the proposed HybGANN model, as shown in Fig. 5 (d), demonstrates a significant improvement in the classification of people with diabetes. The model achieves a better balance between recall and accuracy due to the use of a hybrid optimization approach combining CTGAN for data balancing and GA for the optimization of neural network weights and hyperparameters. Compared to the baseline ANN model, HybGANN significantly reduces the number of classification errors and increases the ability to detect the minority class. In all major evaluation metrics, HybGANN outperforms the baseline ANN model described in Section 4.2. The significant increase in recall and F1 score suggests that the proposed model is more effective in identifying diabetes-positive cases, while reducing minority class classification errors. The improvement in PR AUC value as shown in Fig. 5 (b), confirms that HybGANN better manages the trade-off between accuracy and recall in a class-imbalanced context. As shown in Fig. 5 (a), the ROC AUC value shows a slight improvement from the baseline model, while the histograms of predicted probabilities in Fig. 5 (c) show a clear separation between classes in the HybGANN compared to the baseline ANN model. These improvements can be attributed to the evolutionary optimization mechanism, which not only finds better configurations but also helps to avoid stalling in local minima and overfitting. The integration of the Lamarckian approach allows the model to retain the skills acquired during partial training, providing an effective combination of global exploration and local adaptation. In addition to its high performance, HybGANN also shows a high level of interpretability, as demonstrated by the SHAP analysis. The most influential feature turned out to be MentHlth with a wide distribution of SHAP values, from approximately -0.2 to +0.8. This shows that people with many mentHlth days have a strong influence on the prediction of the positive class, while those with few or no problems eliminate this probability. Next, GenHlth had a strong influence, especially when the general condition was poor. Clinical features such as BMI, HighChol and HighBP still contributed significantly, while Age ranked sixth in importance. Based on SHAP values, HybGANN incorporates average variations, especially for features such as MentHlth and GenHlth. For HighChol and HighBP, the impact is mainly positive, implying that the presence of these conditions significantly increases the probability of a positive diagnosis. Using the combined JSD-SHAP analysis, it is observed that features with large changes with respect to CTGAN (such as MentHlth and Age) also present a high significance in SHAP, indicating that the synthetic data have enriched the model’s understanding of these features. In contrast, features such as PhysHlth presented a high JSD, but a low importance in SHAP, suggesting that perhaps useless examples or outliers have been added that the model does not use effectively. On the other hand, stable features such as GenHlth, BMI and HighChol maintain a high impact, although they were not greatly affected by the CTGAN synthetic data. 5. Discussion and interpretation 5.1 Comparative Performance Analysis To evaluate the impact of the proposed HybGANN architecture, an immediate comparison was performed with the baseline ANN model, trained in the CTGAN balanced dataset but without using GA. Table 1 presents the final test metrics of both models. As can be noticed, HybGANN achieves improvements in all key performance metrics: Table 1 Performance Comparison Between Baseline ANN and HybGANN Models Metric Baseline ANN HybGANN Improvement Accuracy 0.7804 0.8319 + 0.0515 Precision 0.6809 0.8960 + 0.2151 Recall 0.5383 0.7510 + 0.2127 F1 Score 0.6012 0.8171 + 0.2150 ROC AUC 0.8294 0.9184 + 0.0890 PR AUC 0.6877 0.9268 + 0.2391 The most significant improvements are seen in precision, recall, F1 score and PR AUC value, important metrics in class-imbalanced scenarios such as diabetes prediction, where positive classes are in the minority. In particular, the AUC of PR increased from 0.6877 to 0.9268, a significant change demonstrating that HybGANN has significantly fewer false positives and false negatives and achieves a better balance between recall and accuracy. In addition, the ROC AUC improves by almost 9 percentage points, indicating a higher discriminative ability of the HybGANN model to distinguish between classes. On the other hand, the high F1 score of HybGANN (0.8171 compared to 0.6012 in the baseline ANN) highlights that this model does not sacrifice recall for accuracy but achieves an optimal balance between both objectives. This is especially important in medical applications, where accurate identification of positive cases is crucial. These results confirm that the use of Lamarckian genetic optimization, together with data balancing using GAN, significantly improves the classification capability of the model in real situations with imbalanced distributions of classes. 5.2 GA A key element of the HybGANN architecture is the evolutionary optimization of the ANN using a GA with a Lamarckian approach. This component contributes significantly to the improvement of the overall model performance, as demonstrated in the previous comparative analysis. Unlike traditional ANN training approaches, which rely on gradient algorithms and local optimization, evolutionary optimization globally explores the space of possible configurations, avoiding stagnation at local minima and exploring more efficiently the optimal configurations of the model [ 3 ]. This process involves representing and organizing the network structure, initial weights, learning rate and batch size, which evolve through natural selection and mutation mechanisms. In particular, the selection of parameters such as: personalized learning rate, which is better adapted to the data structure and increases the stability of the training, optimal batch size, which affects the convergence and fluctuations of the loss function; optimized initial weights instead of random initializations, have a direct and significant impact on training quality and final network performance. In our approach, the use of Lamarckism offers an additional advantage: during evolution, individuals are partially trained and their improved weights are incorporated into the following generations. This form of acquired inheritance increases convergence to superior solutions, effectively combining global exploration with local improvement. The contribution of this mechanism is reflected in the experimental results: Recall increases significantly (from 0.5383 to 0.7510), improving sensitivity in minority class classification. The PR AUC value increased from 0.6877 to 0.9268, demonstrating better management of the balance between accuracy and recall. The F1 score improved from 0.6012 to 0.8171, reflecting better balance between type I and type II errors. Besides the increased performance, another advantage is the reduction of manual intervention. Instead of determining optimal configurations through frequent and costly hypergeneralization tests, the algorithm automatically develops the best solutions, which increases the overall efficiency of model development. 5.3 Interpretability with SHAP To evaluate how our model classifies and which features have the greatest impact on classes, the SHAP method was applied to both the baseline ANN and the HybGANN framework. This analysis enables us to understand which features contribute most to the classification of an instance as diabetic and in which direction (positive or negative). 5.3.1 Comparison of Feature Importance As can be seen in Table 2 , the HybGANN model not only retains traditional features (such as HighBP and HighChol), but also identifies new high-impact features, such as MentHlth and PhysHlth, which the baseline ANN model made virtually no use of. This suggests that CTGAN has helped enrich the data with cases with more diverse values, while GA has increased the sensitivity of the model to these features. Table 2 Features with the highest impact according to SHAP Feature HybGANN (SHAP) Baseline ANN (SHAP) MentHlth ~ 0–0.8 ~–0.05–0.05 GenHlth ~ 0.4 ~ 0.2 BMI ~ 0–0.3 ~–0.05–0.1 HighChol ~ 0.25 ~ 0.15 HighBP ~ 0.18 ~ 0.10 Age ~ 0.20 ~ 0.05 5.3.2 Analysis of features with high JSD and their influence on model decision-making using SHAP To better understand how the synthetic data generated by CTGAN may differ from the real data, the JSD was calculated for each feature. Features with high JSD values indicate larger discrepancies in the distribution of the real and synthetic data. These differences can affect model behavior and classification cut-off points, especially in sensitive domains such as medical diagnostics. In this subsection, we focus on the features with the greatest divergence and examine their impact on the model decision-making process using SHAP. By combining statistical analysis of divergence with model interpretability techniques, we seek to better understand how these features contribute to predictions and whether their synthetic representation introduces any bias or modifies their significance. A combined analysis was performed using: JSD: to measure the degree of change in the distribution of each feature between the real and synthetic data generated by CTGAN. SHAP: to assess the impact of each feature on the decision-making process of the model. The SHAP values of the proposed model were compared with those of the baseline ANN. This combination seeks to detect features that were not important in the baseline model, but which, after optimization with GA, have acquired a decisive weight in the final model. Table 3 presents three features with high JSD values, comparing them in terms of their SHAP impact on the HybGANN and baseline ANN models. The interpretations help us understand how synthetic equilibrium has contributed to strengthening the importance of these features: Table 3 Features with highest JSD and their SHAP importance across models Feature JSD Baseline ANN(SHAP) HybGANN (SHAP) MentHlth 0.2613 (~ 0.8) very important (~ 0.05) unused PhysHlth 0.2294 (~ 0.10) medium–high (~ 0.2) low Age 0.3523 (~ 0.20) high (~ 0.5) medium In all of the above cases, features with high divergence JSD not only became more frequent in the generated data, but also translated into features with a real impact on HybGANN model predictions. This indicates that the CTGAN has helped to incorporate useful new information, while the GA has learned to use it to improve decision-making. This result suggests that high JSD values associated with an increase in SHAP for a feature are a positive signal indicating an immediate benefit in model performance, beyond just an increase in its numerical frequency. Analysis with SHAP confirms that proposed model not only improves classification performance, but also increases model interpretability, making HybGANN a more tractable and reliable model, especially in scenarios where interpretability is essential, such as in medical diagnostics. 5.3.3 Limitations and potential improvements The proposed model offered improved performance and detailed interpretation of the results, but on the other hand, several limitations remain to be addressed as future work. Testing the model in an actual clinical application requires rigorous validation with new datasets from different sources, or even different populations. Also, the performance of the model is closely related to the quality of the synthetic data generated in CTGAN. As future work, other GANs for tabular data will be integrated. While the Lamarckian GA hybrid offers advanced optimization, it is also costly in terms of time, but Bayesian optimization can provide a balance between efficiency and performance. 6. Conclusion This paper proposes a novel hybrid model for diabetes prediction, which integrates advanced AI methods in a way that has not been explored previously in the existing literature. The proposed model, called HybGANN, combines synthetic data generation with CTGAN for class balancing, with a hybrid GA to optimize the initial weights and tune the hyperparameters of an ANN. This approach seeks to address the difficulties that arise when managing imbalanced classes and increase the accuracy of prediction in medical applications. Experiments have shown that the HybGANN model outperforms in all key classification metrics a baseline ANN network that also uses the same dataset pre-balanced by CTGAN, but does not have an integrated GA. The use of SHAP to analyze the decision-making of the model has shown that this model is not only more accurate, but also more interpretable, giving greater importance to features with different distributions than synthetic balancing. Features with high JSD, such as MentHlth, PhysHlth, and Age, have shown a significant increase in their importance in the model’s decision-making, reflecting the positive impact of the data generated by CTGAN. This combination of improved data quality and an optimized learning mechanism indicates a breakthrough in building smarter systems for medical diagnostics. The results highlight the importance of a comprehensive approach to produce models that are not only more efficient, but also more reliable for practical use. This paper also prepares the ground for future research that could extend this approach to other medical domains or explore other variants of GANs or evolutionary techniques. This comprehensive analysis shows that HybGANN not only improves performance compared to traditional models, but also manages to identify new factors of diagnostic importance. Abbreviations ANN Artificial Neural Network AnyHealthcare Has Any Form of Healthcare Coverage BMI Body Mass Index CholCheck Cholesterol Check (had cholesterol checked recently) CTGAN Conditional Tabular Generative Adversarial Network DiffWalk Difficulty Walking Education Education Level Fruits Fruit Consumption GA Genetic Algorithm GAN Generative Adversarial Network GenHlth General Health HeartDiseaseorAttack History of Coronary Heart Disease or Heart Attack HighBP High Blood Pressure HighChol High Cholesterol HvyAlcoholConsump Heavy Alcohol Consumption HybGANN Hybrid GA with ANN using GAN JSD Jensen-Shannon Divergence MentHlth Mental Health NoDocbcCost No Doctor Because of Cost PCA Principal Components Analysis PhysActivity Physical Activity PhysHlth Physical Health PR AUC Precision-Recall Area Under the Curve ROC AUC eceiver Operating Characteristic – Area Under the Curve SHAP SHapley Additive exPlanations Smoker Smoking Status Stroke History of Stroke Veggies Vegetable Consumption Declarations Ethics approval and consent to participate Not applicable. Consent for publication Not applicable. Availability of data and materials The datasets generated and/or analysed during the current study are available in the UC Irvine Machine Learning Repository repository, CDC Diabetes Health Indicators. https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset Competing interests The authors declare that this research works have no competing interests. Funding Not applicable. Author contributions All authors contributed equally to the conceptualization, methodology, analysis, and writing of the manuscript. All authors have read and approved the final version. Acknowledgements Not applicable. References Salmi, Mabrouka, Dalia Atif, Diego Oliva, Ajith Abraham, and Sebastian Ventura. "Handling imbalanced medical datasets: review of a decade of research." Artificial Intelligence Review 57, no. 10 (2024): 273. Ramesh, Jayroop, Raafat Aburukba, and Assim Sagahyroon. "A remote healthcare monitoring framework for diabetes prediction using machine learning." Healthcare Technology Letters 8, no. 3 (2021): 45-57. Sampson, Jeffrey R. "Adaptation in natural and artificial systems (John H. Holland)." (1976): 529. Bergstra, James, and Yoshua Bengio. "Random search for hyper-parameter optimization." The journal of machine learning research 13, no. 1 (2012): 281-305. Rajendra, Priyanka, and Shahram Latifi. "Prediction of diabetes using logistic regression and ensemble techniques." Computer Methods and Programs in Biomedicine Update 1 (2021): 100032. Azad, Chandrashekhar, Bharat Bhushan, Rohit Sharma, Achyut Shankar, Krishna Kant Singh, and Aditya Khamparia. "Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus." Multimedia Systems (2022): 1-19. Chawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. "SMOTE: synthetic minority over-sampling technique." Journal of artificial intelligence research 16 (2002): 321-357. Figueira, Alvaro, and Bruno Vaz. "Survey on synthetic data generation, evaluation methods and GANs." Mathematics 10, no. 15 (2022): 2733. Sauber-Cole, Rick, and Taghi M. Khoshgoftaar. "The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey." Journal of Big Data 9, no. 1 (2022): 98. Bourou, Stavroula, Andreas El Saer, Terpsichori-Helen Velivassaki, Artemis Voulkidis, and Theodore Zahariadis. "A review of tabular data synthesis using GANs on an IDS dataset." Information 12, no. 09 (2021): 375. Borisov, Vadim, Tobias Leemann, Kathrin Se√üler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. "Deep neural networks and tabular data: A survey." IEEE transactions on neural networks and learning systems (2022). Xu, Lei, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. "Modeling tabular data using conditional gan. arXiv 2019." arXiv preprint arXiv:1907.00503 1 (2019). Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. "Improved training of wasserstein gans." Advances in neural information processing systems 30 (2017). Rajabi, Amirarsalan, and Ozlem Ozmen Garibay. "Tabfairgan: Fair tabular data generation with generative adversarial networks." Machine Learning and Knowledge Extraction 4, no. 2 (2022): 488-501. Iranmanesh, Seyed Mehdi, and Nasser M. Nasrabadi. "HGAN: Hybrid generative adversarial network." Journal of Intelligent & Fuzzy Systems 40, no. 5 (2021): 8927-8938. Gayathri, R. G., Atul Sajjanhar, and Yong Xiang. "Hybrid deep learning model using spcagan augmentation for insider threat analysis." Expert Systems with Applications 249 (2024): 123533. Brock, Andrew, Jeff Donahue, and Karen Simonyan. "Large scale GAN training for high fidelity natural image synthesis." arXiv preprint arXiv:1809.11096 (2018). Tümay Ateş, Kübra, İbrahim Erdem Kalkan, and Cenk Şahin. "Training Artificial Neural Network with a Cultural Algorithm." Neural Processing Letters 56, no. 5 (2024): 225. Arroyo, Jan Carlo T., and Allemar Jhone P. Delima. "An optimized neural network using genetic algorithm for cardiovascular disease prediction." Journal of Advances in Information Technology 13, no. 1 (2022). Felizardo, Virginie, Nuno M. Garcia, Nuno Pombo, and Imen Megdiche. "Data-based algorithms and models using diabetics real data for blood glucose and hypoglycaemia prediction-a systematic literature review." Artificial Intelligence in Medicine 118 (2021): 102120. Jolliffe, Ian. "Principal component analysis." In International encyclopedia of statistical science, pp. 1094-1096. Springer, Berlin, Heidelberg, 2011. J. Lin, "Divergence measures based on the Shannon entropy," in IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145-151, Jan. 1991, doi: 10.1109/18.61115. Lundberg, Scott M., and Su-In Lee. "A unified approach to interpreting model predictions." Advances in neural information processing systems 30 (2017). Lundberg, Scott M., Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. "From local explanations to global understanding with explainable AI for trees." Nature machine intelligence 2, no. 1 (2020): 56-67. Davis, Jesse, and Mark Goadrich. "The relationship between Precision-Recall and ROC curves." In Proceedings of the 23rd international conference on Machine learning, pp. 233-240. 2006. Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters 27, no. 8 (2006): 861-874. Yao, Xin. "Evolving artificial neural networks." Proceedings of the IEEE 87, no. 9 (1999): 1423-1447. Singh, Vijendra, Vijayan K. Asari, and Rajkumar Rajasekaran. "A deep neural network for early detection and prediction of chronic kidney disease." Diagnostics 12, no. 1 (2022): 116. Feng, Xin, Yihuai Cai, and Ruihao Xin. "Optimizing diabetes classification with a machine learning-based framework." BMC bioinformatics 24, no. 1 (2023): 428. Jaiswal, Sushma, and Priyanka Gupta. "GLSTM: a novel approach for prediction of real & synthetic PID diabetes data using GANs and LSTM classification model." Int J Exp Res Rev 30 (2023): 32-45. Mishra, Sushruta, Hrudaya Kumar Tripathy, Pradeep Kumar Mallick, Akash Kumar Bhoi, and Paolo Barsocchi. "EAGA-MLP-an enhanced and adaptive hybrid classification model for diabetes diagnosis." Sensors 20, no. 14 (2020): 4036. Pekel Özmen, Ebru, and Tuncay Özcan. "Diagnosis of diabetes mellitus using artificial neural network and classification and regression tree optimized with genetic algorithm." Journal of Forecasting 39, no. 4 (2020): 661-670. Vu, Ly, and Quang Uy Nguyen. "Handling imbalanced data in intrusion detection systems using generative adversarial networks." Journal of Research and Development on Information and Communication Technology 2020, no. 1 (2020): 1-13. Xiao, Yawen, Jun Wu, and Zongli Lin. "Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data." Computers in Biology and Medicine 135 (2021): 104540. Teboul, Alex. 2021. Diabetes Health Indicators Dataset. Kaggle. https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7300855","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Method Article","associatedPublications":[],"authors":[{"id":513947548,"identity":"3d8609df-da6c-4dfd-8cf1-1ec6b9b4d02a","order_by":0,"name":"Nora PireciSejdiu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7klEQVRIiWNgGAWjYDACZhBhAGYyPmA4YAHjJhClhdmA4YAEEVqQAJsEUVrM2XkfPvhRwJDHL334WTXPGQk5BvbmbRKMOWk4tVg2sxsb9hgwFEv2pZnd5rkhYczAc6xMgnFbDk4tBofZ2KSBLknccIYBqOWDRGKDRI4ZUEsFPi3svyFa2L8Vg7XIvyGohY0ZooXHjBnoMKAtPGZ4HWbZzMYs2WMgkTizh6dYcs4ZCWM2nrRii8RtuL1vzn+M8cOPPzaJ/TzsGz+8OWYjx89+eOONj9uScTsMQkkgRNhARAJODXAto2AUjIJRMArwAADjOEazaK/FQgAAAABJRU5ErkJggg==","orcid":"","institution":"University of St. Kliment Ohridski","correspondingAuthor":true,"prefix":"","firstName":"Nora","middleName":"","lastName":"PireciSejdiu","suffix":""},{"id":513947549,"identity":"6a35539a-81f2-40d8-8db7-c9293fdc7313","order_by":1,"name":"Blagoj Ristevski","email":"","orcid":"","institution":"University of St. Kliment Ohridski","correspondingAuthor":false,"prefix":"","firstName":"Blagoj","middleName":"","lastName":"Ristevski","suffix":""}],"badges":[],"createdAt":"2025-08-05 12:53:32","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7300855/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7300855/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":91845840,"identity":"3cf85e34-d5a8-418d-89c6-f91eaff434f6","added_by":"auto","created_at":"2025-09-22 10:13:58","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":739853,"visible":true,"origin":"","legend":"","description":"","filename":"HybGANNresearchpaper.docx","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/7d1266a4a30a2da3a35585f2.docx"},{"id":91845836,"identity":"a73a48ba-c24a-4e11-a1b6-fedd9fff6174","added_by":"auto","created_at":"2025-09-22 10:13:58","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4552,"visible":true,"origin":"","legend":"","description":"","filename":"d1b5036547cf4ee3a412c92bb4f805eb.json","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/f3e7360d99783e7e2c113ec9.json"},{"id":91845838,"identity":"c16bb845-c0aa-426d-b951-0da4fa36730b","added_by":"auto","created_at":"2025-09-22 10:13:58","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":107366,"visible":true,"origin":"","legend":"","description":"","filename":"d1b5036547cf4ee3a412c92bb4f805eb1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/46bc633a449361f3095fda8b.xml"},{"id":91845835,"identity":"ae4e649e-856a-4923-a027-d47ad89be824","added_by":"auto","created_at":"2025-09-22 10:13:58","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":68858,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/15b15fc46702501442358f1a.png"},{"id":91845839,"identity":"4579d3a6-c91a-45e4-9e89-6a68da860362","added_by":"auto","created_at":"2025-09-22 10:13:58","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":539712,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/a5f8a06cc9d8566522d7c424.jpeg"},{"id":91843430,"identity":"68ed82ad-9228-4bca-af34-59ce0f70674e","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":368591,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/9e53029c1dfed460b39c6817.jpeg"},{"id":91843434,"identity":"13ad8816-2246-470a-a5e2-a0e9cad53cf8","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":97039,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/2a8dbe214eabf139186d0231.png"},{"id":91847961,"identity":"570446f6-4ef1-4812-86be-a44cde7b735b","added_by":"auto","created_at":"2025-09-22 10:29:58","extension":"jpeg","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":366457,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/9c6987cb01f6bbbf058a2358.jpeg"},{"id":91847324,"identity":"8071d757-7ffa-4fbc-a7cc-ce05bf719b36","added_by":"auto","created_at":"2025-09-22 10:21:58","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":70665,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/6a36ec15694634e6fdf20481.png"},{"id":91843436,"identity":"da16547a-c13a-4292-b0b7-03f2f4e99100","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21560,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/d9ae6e505bf2b383dea793a2.png"},{"id":91843443,"identity":"fcc95764-27fc-4987-a04f-ed7614c24a38","added_by":"auto","created_at":"2025-09-22 10:06:00","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":164369,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/39f1b85f54d97709d49b8171.png"},{"id":91843444,"identity":"d1c1a46d-8e6d-437a-a22a-c72349666cd2","added_by":"auto","created_at":"2025-09-22 10:06:02","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":76804,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/cab288f8eede338396350aaa.png"},{"id":91843437,"identity":"430eebc5-f33d-440f-95bb-f97541eed435","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":24873,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/3866c240d8e6f46ca7ebcebb.png"},{"id":91845844,"identity":"48ed5f8d-9470-4862-a039-7a5275a157b4","added_by":"auto","created_at":"2025-09-22 10:14:02","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":76227,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/724f7abb5ef91838b0698bc0.png"},{"id":91843438,"identity":"30ce3116-79bf-43bd-bfd8-38b5bfffcdbd","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"png","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19414,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/2ee74545ea88be2abec90f46.png"},{"id":91843442,"identity":"80c37056-cb28-43f7-9126-93bd4d91637a","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"xml","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":106335,"visible":true,"origin":"","legend":"","description":"","filename":"d1b5036547cf4ee3a412c92bb4f805eb1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/8c0d92ae4ff794fc1a57a6e3.xml"},{"id":91847325,"identity":"d92f327b-310b-44fd-80ff-4bd1aaa9c858","added_by":"auto","created_at":"2025-09-22 10:21:58","extension":"html","order_by":16,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":114644,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/5341660dbf9879819226c65c.html"},{"id":91843429,"identity":"9f80bda6-6bc6-4d7f-9233-5c424bd4c605","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":68858,"visible":true,"origin":"","legend":"\u003cp\u003eProposed research framework.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/8ff3ca903baef2bf00931ff9.png"},{"id":91843423,"identity":"25007a32-80d1-4ee0-84e4-e7acc1d6c41b","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"jpeg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":301230,"visible":true,"origin":"","legend":"\u003cp\u003ePCA for real vs. synthetic data generated and JSD from CTGAN.\u003c/p\u003e","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/0fd5569a79d1e6afedf72639.jpeg"},{"id":91843424,"identity":"3930067c-8c50-4a5b-9975-0652e368f920","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"jpeg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":368591,"visible":true,"origin":"","legend":"\u003cp\u003eVisualization of baseline ANN Model Evaluation Metrics.\u003c/p\u003e","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/94c4d3f560e87fc6e58f50a9.jpeg"},{"id":91843422,"identity":"ac319b0a-3115-4f99-96da-46a4399d7ad6","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":97039,"visible":true,"origin":"","legend":"\u003cp\u003eSHAP summary plot for the baseline ANN model.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/1ea23f6c81f8a47c9d05ff82.png"},{"id":91843427,"identity":"753257e4-ae6f-4f2c-8cca-3c8c67f48056","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"jpeg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":366457,"visible":true,"origin":"","legend":"\u003cp\u003eVisualization of HybGANN Model Evaluation Metrics.\u003c/p\u003e","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/93ffe3f70a5eda31dae5e6a1.jpeg"},{"id":91843425,"identity":"74dd5a79-17b6-4afe-ac95-d577c7fba115","added_by":"auto","created_at":"2025-09-22 10:05:58","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":70665,"visible":true,"origin":"","legend":"\u003cp\u003eSHAP summary plot for the HybGANN model.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/50985576b4a9616bab3207c6.png"},{"id":93059305,"identity":"b5891313-bec7-457c-9ed5-a55fc1e32fef","added_by":"auto","created_at":"2025-10-08 15:32:22","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2197834,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7300855/v1/9dd1192e-afc8-44ad-9488-ebf7077acab2.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"HybGANN: A Hybrid GAN-GA-ANN Framework for Predicting Diabetes from Imbalanced Medical Data","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eIn the medical field, the prediction of chronic diseases poses a major challenge due to the nature of medical data, which often present imbalances and a diverse number of features. This directly affects the performance of machine learning (ML) algorithms and ANN, limiting their ability to accurately identify clinical cases, especially those belonging to the minority class, i.e. patients with positive diagnoses. In this context, advanced data balancing techniques are essential to obtain more accurate and reliable results [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. In the literature, several methods have been proposed to address the challenges of class imbalance that include the use of data balancing techniques such as SMOTE, SMOTETomek, Adasyn, etc. [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], while more recent developments emphasize generative models, such as Generative Adversarial Networks (GANs), to generate synthetic samples that realistically reflect the original data distribution. The most well-known models for tabular data that have yielded better results include: Wgan-GP, TabFairGan and especially CTGAN (Conditional Tabular GAN), which is designed to address the challenges of tabular data with combinations of categorical and numeric features, while preserving the complex relationships between them [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e, \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e, \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Simultaneously, the optimization of ANN using evolutionary algorithms, especially genetic algorithms (GA) [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], has demonstrated high efficiency in automatically determining hyperparameters and initial weight configurations, improving the convergence process and the final performance of the models. However, most previous studies apply GA to hyperparameter selection or feature selection, while the inclusion of a full optimization of weights and hyperparameters in the same process is still an emerging field [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eIn this paper, we present a novel and innovative model termed HybGANN (Hybrid GA with ANN using GAN) that represents a hybrid approach to a GA that in parallel optimises weights and tunes hyperparameters to train an ANN. This improves the prediction of diabetes in the early stages of the disease by integrating advanced data augmentation techniques into imbalanced medical datasets. This Lamarckian approach, which uses weight updates and hyperparameter tuning during training, to reflect changes in future generations, has not previously been used in the literature for diabetes prediction with imbalanced data. The model was trained on a large-scale real-world diabetes prediction dataset from the UCI Machine Learning Repository [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e], which contains over 250,000 samples with an imbalance ratio of approximately 6:1 in favor of the negative (non-diabetic) class. To address this imbalance, CTGAN is used to generate synthetic samples only for the minority class, creating a balanced dataset. The quality of the synthetic data is assessed by visual analysis such as Principal Components Analysis (PCA) and statistical measures such as Jensen-Shannon Divergence (JSD), which confirm a high approximation to the original data [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eEmpirical results have shown that HybGANN significantly outperforms basic neural network models trained without evolutionary optimization and data balancing. Improvements are observed in all key metrics, especially in recall and F1 score, crucial in medical applications to minimize errors in the classification of patients with diabetes [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Furthermore, interpretability analysis with SHapley Additive exPlanations (SHAP) demonstrates that the model is not only accurate, but also transparent, identifying the most important features that affect its decisions.\u003c/p\u003e\u003cp\u003eThe main contributions of our work are as follows:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eWe propose HybGANN, a novel hybrid framework that integrates CTGAN for class balancing and a GA for simultaneous optimization of neural network weights and hyperparameter tuning. To the best of our knowledge, this two-level GA optimization approach has not been previously applied to diabetes prediction tasks.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eWe demonstrate that integrating GA at both levels - optimizing internal network weights and tuning critical hyperparameters such as learning rate and group size - results in significantly better predictive performance compared to a traditional ANN trained on CTGAN-balanced data.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eWe undertake an extensive evaluation on a large and highly imbalanced diabetes dataset, achieving improved performance in multiple metrics, especially in terms of recall and F1 score, which are essential in medical diagnostics where accurate identification of diabetic patients is crucial.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eA detailed comparison with the ANN baseline highlights the effectiveness of the HybGANN approach in reducing false negatives while maintaining a high true positive rate, making it more reliable for early diabetes detection.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eWe support our findings through comprehensive visualization techniques, such as confusion matrices, ROC and PR curves, and SHAP summary plots, ensuring that our model is not only robust but also interpretable, transparent, and reusable for future research on imbalanced medical classification.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe remaining sections of this work are organized as follows. In Section 2, we discuss the existing methods for handling non-balanced data and the use of GA and neural networks in disease diagnosis. The subsequent section 3 explains the HybGANN model in detail, including the neural network architecture, the Lamarckian weight optimization mechanism and the hyperparameter selection process. Empirical results on the diabetes prediction dataset are presented in section 4, comparing the performance of HybGANN with benchmark models and alternative data balancing methods. Section 5 addresses the practical implications of the findings, limitations of the study and provides suggestions for future research.\u003c/p\u003e"},{"header":"2. Related Work","content":"\u003cp\u003eRecent research has explored the integration of advanced ML, deep learning (DL) and generative models, showing improvement in disease diagnosis, addressing common challenges such as data class imbalance, noise and a large number of features.\u003c/p\u003e\u003cp\u003eIn [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e], the authors created a deep neural network for the detection of chronic kidney disease (CKD) using Recursive Feature Elimination (RFE) for all critical features such as Haemoglobin and serum level. The 12-layer neural network was shown to outperform conventional classifiers such as Support Vector Machine, K-Nearest Neighbors, Log Regression, Random Forest and Na\u0026iuml;ve Bayes, achieving 100% accuracy.\u003c/p\u003e\u003cp\u003eSimilarly, several studies have focused on diabetes prediction using hybrid models and data augmentation strategies. In the paper [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e], the authors proposed a new resource for diabetes classification. The proposed framework uses SMOTEENN to balance the dataset and introduces a DCSGAN classifier that is based on the GAN model for synthetic data generation, achieving an accuracy of 96.27%. This study also performed feature analysis using logistic regression and identified critical biomarkers such as glucose, body mass index (BMI), and Diabetes Pedigree Function.\u003c/p\u003e\u003cp\u003eAnother contribution to diabetes prediction is the GLSTM model [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], which combines GAN-based tabular data augmentation with an LSTM network for classification. The study addressed the challenges associated with sensitive and spatial data by generating different synthetic sets using multiple GAN architectures (CTGAN, Vanilla GAN, Gaussian Copula GAN, etc.). The synthetic data showed a strong correlation (0.93) with the original data. Trained on both real and synthetic data, the GLSTM model achieved 97% accuracy, outperforming models trained solely on real data. This demonstrated the value of synthetic data in protecting patient privacy, keeping care activities private, and improving prediction performance.\u003c/p\u003e\u003cp\u003eIn paper [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e], the authors used an enhanced adaptive genetic algorithm (EAGA) to avoid irrelevant features in medical datasets before adding them to a Multilayer Perceptron (MLP) classifier. This GA model adjusted crossover and mutation probabilities and refined the fitness function for better feature selection. The EAGA-MLP model achieved a high accuracy of 97.76% ensuring the effectiveness of the model in different clinical scenarios.\u003c/p\u003e\u003cp\u003eAlso, the study in paper [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] proposed hybrid approaches - ANN-GA and CART-GA - by integrating GAs in these modes of ANN and Classification and Regression Tree to optimise their parameters. Tested on the Pima Indian Diabetes dataset, the CART-GA variant achieved superior performance, recording an accuracy of 96.05% in expectation validation and 93.42% in 10-fold validation. The use of GA improved the performance, showing the impact of optimisation over a predictive model.\u003c/p\u003e\u003cp\u003eGAN models have also been effectively used in other areas such as intrusion detection and cancer prediction. In paper [53], the challenge of imbalanced datasets in Intrusion Detection Systems (IDS) was addressed by using GANs to synthesize sparse attack data. ACGAN and ACGAN-SVM models were used to generate realistic attack samples, which were then augmented with real datasets (NSL-KDD, CICIDS2017, etc.) to train classifiers such as Decision Trees, Random Forests, and SVM. The results showed that the datasets augmented by GANs improved the classifier performance more than traditional re-modeling techniques such as SMOTE-SVM or Tomek Links. ACGAN-SVM was particularly successful in filtering noisy synthetic data, thereby increasing detection accuracy.\u003c/p\u003e\u003cp\u003eIn the field of cancer diagnosis, paper [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e] has proposed an improved Wasserstein GAN (WGAN) for generating synthetic minority samples in cancer gene expression datasets, which often suffer from severe class imbalance. The study replaced convolutional layers with deep fully connected networks to better fit numerical non-imaging gene expression data. Feature selection was performed through differential gene expression analysis before training the WGAN. Results on three RNA-seq datasets (breast, lung, and gastric cancer) showed that WGAN outperformed traditional methods such as random oversampling and SMOTE in all metrics, including precision, recall, F1 score, and AUC value. This confirmed the ability of WGAN to generate meaningful synthetic samples that improve classification tasks.\u003c/p\u003e"},{"header":"3. Methodology","content":"\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e3.1. Data Preparation and GAN-Based Balancing\u003c/h2\u003e\u003cp\u003eIn this article, we used the \"Diabetes Health Indicators\" dataset from the UCI Machine Learning Repository (data ID: 891 [\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. This dataset contains 253.680 cases and 21 features, including \u003cb\u003edemographic variables\u003c/b\u003e such as age, gender, educational level, etc., \u003cb\u003ebehavioral attributes\u003c/b\u003e such as smoking, physical activity, alcohol consumption, and \u003cb\u003ehealth indicators\u003c/b\u003e such as physical health, mental health, and BMI. The feature \u0026ldquo;Diabetis_binary\u0026rdquo; represents the target indicating positive cases of diabetes if 1 and negative cases if 0. The biggest challenge in this dataset is the imbalance of classes in a ratio of approximately 6:1 in favour of majority class. This imbalance can affect the performance of ML models, especially in accurately identifying cases in minority classes that are diabetes positive. The identification of these cases in medicine is of particular importance, and more importance is placed on detecting these cases than on detecting negative cases, as most agree that it is better to classify a patient as positive for a given disease and not be affected, than to classify them as negative and be sick and not receive proper treatment promptly.\u003c/p\u003e\u003cdiv id=\"Sec5\" class=\"Section3\"\u003e\u003ch2\u003e3.1.1 Data Collection and Preprocessing\u003c/h2\u003e\u003cp\u003eData analysis is used to understand the distribution of features, identify correlations between variables and detect possible outliers.\u003c/p\u003e\u003cp\u003eData normalisation ensures equality in feature scaling, features are normalized, as this process facilitates efficient convergence during model training and prevents features with larger scales from disproportionately affecting the learning process.\u003c/p\u003e\u003cp\u003eData splitting is pre-processed and divided into training (70%), validation (15%) and testing (15%) subsets. This technique preserves the original class distribution in all subsets.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section3\"\u003e\u003ch2\u003e3.1.2 Tools and Technologies\u003c/h2\u003e\u003cp\u003eThe implementation was carried out using the Python programming language. Several libraries and frameworks were used to support the different stages of the process. TensorFlow and Keras were used to build and train Artificial Neural Network (ANN) models. Scikit-learn was used for data preprocessing and model evaluation through various metrics. NumPy and Pandas were used for efficient data manipulation and analysis. Matplotlib and Seaborn were used for results visualization. The DEAP library was used to implement genetic algorithms (GA), while SDV (Synthetic Data Vault) was used to generate synthetic data using GAN (Generative Adversarial Networks) models.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section3\"\u003e\u003ch2\u003e3.1.3 Addressing Class Imbalance through Data Augmentation\u003c/h2\u003e\u003cp\u003eGiven the significant class imbalance, Conditional Tabular GAN (CTGAN) is implemented on the training set to generate synthetic samples for the minority class as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003cp\u003eCTGAN is an extension of the classical GAN architecture for synthetic data generation, specifically designed for tabular data with a combination of categorical and numeric attributes. The model consists of two submodules: a generator that endeavours to create new synthetic examples that mimic the real data, and a discriminator that aims to distinguish between the real and generated examples [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eCTGAN trains the generator conditionally, considering the distribution of categorical attributes. Initially, categorical features such as smoker, physical activity, sex, education, income, etc., are specified and presented to the model for processing during training. This is a unique feature of CTGAN, as it handles categorical features using a conditional approach that preserves their original distribution and avoids creating unrealistic combinations. First, the number of samples to be generated to equalize the classes is determined, and then synthetic samples are generated only for the minority class. The model is trained for 1000 epochs, during which it learns the distribution of the existing data. The synthetic data are selected and combined with the real data to create a new balanced data set.\u003c/p\u003e\u003cp\u003eTo assess the quality of the synthetic data generated by CTGAN, the visual PCA along the first two principal components was performed. In the PCA plot Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e (a), a significant overlap between the real (blue) and synthetic (orange) data points is observed, suggesting that CTGAN correctly captured the common structure of the original feature space, including the complex relationships between categorical and continuous attributes [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eTo complement the visual analysis, the JSD between the real and synthetic data distributions for each numerical variable was also calculated. Most of the features yielded JSD scores below 0.1, indicating a high degree of similarity and confirming that the synthetic samples are statistically consistent with the real data. These findings validate the use of CTGAN-generated samples as reliable minority class complements for addressing class imbalance in subsequent classification tasks [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e3.2 Development of the HybGANN Model\u003c/h2\u003e\u003cp\u003eThe main novelty of this research lies in the development of the HybGANN model, which is trained on balanced data by GAN. It integrates GA into ANN optimization. GA optimizes ANN hyperparameters, such as learning rate, number of hidden layers, and activation functions, to improve model performance [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. GAs are also used to determine the optimal initial connection weights for ANN, which facilitates better convergence and predictive accuracy [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e\u003cdiv id=\"Sec9\" class=\"Section3\"\u003e\u003ch2\u003e3.2.1 ANN Architecture\u003c/h2\u003e\u003cp\u003eThe predictive model is a forward-propagating ANN with the following structure:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eInput layer matching the number of features.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTwo hidden layers with 128 and 64 neurons, each followed by ReLU (Rectified Linear Unit) activation function.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eOutput layer with a single sigmoid neuron for binary classification.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eRegularization techniques include:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eDropout applied during final training to avoid overfitting.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eL2 regularization in dense layers.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eANN weights and hyperparameters are optimized using a GA described below.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section3\"\u003e\u003ch2\u003e3.2.2 GA for joint hyperparameter and weight optimization\u003c/h2\u003e\u003cp\u003eA novelty of the HybGANN framework is the use of a hybrid GA to simultaneously develop both the training hyperparameters and the internal weights of the ANN. While previous studies typically apply GA to fit hyperparameters or perform feature selection, our approach uniquely encodes the full set of ANN weights along with the key training parameters on a single chromosome, a design rarely addressed in the current literature for medical or tabular datasets.\u003c/p\u003e\u003cp\u003eThis allows the GA to act not only as a black-box hyperparameter tuner but as a true evolutionary learner, optimizing the model from initialization to architecture-level learning, a novel contribution in the field of neuroevolution for imbalanced tabular data [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e, \u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Each individual in the GA population is a composite chromosome encoding:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eRNA flattened weight vector (spanning all layers).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTensor shapes and sizes to reconstruct the original weight matrices.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eLearning rate \u0026isin; [1e-5, 1e-2] (sampled in a uniform logarithmic fashion).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eBatch size \u0026isin; {15, 16, 17, 32, 64}\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThis compact coding allows the seamless integration of low-level neural parameters and high-level training dynamics into a single evolutionary process.\u003c/p\u003e\u003cp\u003eAnother distinctive feature is the use of Lamarckian learning, in which each chromosome, after its fitness evaluation, is updated with the weights learned during its local training session [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]. This allows the learned representations to be inherited by offspring, accelerating convergence and improving the fitness of the population over time. The hybrid optimization strategy has several important implications for convergence and generalization [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e], as detailed below:\u003c/p\u003e\u003cp\u003eThis hybrid weight and hyperparameter GA:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eExplores fine-grained local minima via weight mutation.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAdjusts macro-level learning dynamics via learning rate and batch size evolution.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMaintains generalization and adaptability between datasets without the need for manual adjustments.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eEach chromosome is initialized with random weights and randomly sampled hyperparameters. The ANN architecture remains fixed. The fitness of a chromosome is evaluated as follows:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eThe encoded weights are unfolded and assigned to the ANN.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThe model is compiled using the learning rate of the chromosome and trained for a small number of local epochs in a stratified split of the training set.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eValidation accuracy is used as a fitness index. After evaluation, the updated model weights are unfolded and stored back in the chromosome (a Lamarckian approach), allowing the learned representations to be passed on to the next generation. The genetic operators used in the GA are as follows:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eSelection: Tournament selection is used to choose parents according to their fitness.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eCrossover: Combines learning rates of parents (geometric mean), mixes lot sizes and averages weights to create offspring.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMutation: Randomly perturbs the learning rate, lot size or injects Gaussian noise into the weights.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe best-performing individuals (elites) are preserved across generations to avoid regression. The population iteratively evolves over G generations (default value: 50), which improves validation performance.\u003c/p\u003e\u003cp\u003eAfter evolution, the best-performing chromosome is selected. The corresponding ANN is reconstructed with:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eEvolved weights\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eEvolved hyperparameters\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eAdditional dropout and L2 regularization\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe model is trained by early stopping on the combined training and validation sets, and then evaluated on the retained test set.\u003c/p\u003e\u003c/div\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.5. Model Evaluation\u003c/h2\u003e\u003cp\u003eThe evaluation of the model is based on the well-known metrics such as: accuracy, recall, F1 score, AUC-ROC value and AUC-PR value. In addition to numerical scores, PR curves, ROC curves and prediction distributions are generated to visualize performance [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e, \u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eTo improve transparency, SHAP is used to analyze the contribution of each feature to the model predictions [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eEvaluation also includes the following aspects:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eBaseline model comparison: The performance of the proposed HybGANN model is compared to a baseline ANN trained on the same balanced dataset with CTGAN, but without GA optimization.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eConfusion matrix analysis: Confusion matrices are used to examine the distribution of true positives, false positives, true negatives, and false negatives, providing a detailed insight into the types of errors made by the model.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eEarly stopping and generalization monitoring: To prevent overfitting, early stopping criteria are applied based on validation loss during training.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eComparison of distribution between real and synthetic data: The quality of synthetic samples generated by CTGAN is evaluated using JSD, to verify whether the synthetic data closely resembles real samples of the minority class.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eImportance of PR AUC value: Given the imbalanced data, PR AUC value is given more weight than ROC AUC value in the final model ranking, as it better reflects the model's ability to identify the minority class correctly.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Experimental Setup and Results","content":"\u003cp\u003e\u003cstrong\u003e4.1 Dataset Balancing with GANs\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe proposed model is developed and tested on a large-scale real-world diabetes dataset comprising 253.680 cases and 21 different features, which contained a significant disparity between majority and minority classes, which significantly prevent the proper training of classification models. To address this problem, the CTGAN (Conditional Tabular GAN) model, which specializes in generating synthetic tabular data with interdependent features and heterogeneous distributions, was used. Visual analysis using PCA showed significant overlap between the real (blue) and synthetic (orange) data, especially in the denser areas of the principal components, suggesting that CTGAN effectively learned the subtheme structure of the real data as shown in Fig. 2 (a).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eIn addition, the wider distribution of synthetic points at the edges of PC1 and PC2 indicates that the model generated greater diversity, including underrepresented regions of the data space, a desirable feature for reducing overfitting and improving generalization of models such as ANNs.\u003c/p\u003e\n\u003cp\u003eQuantitative assessment was performed using the JSD metrics. As presented in Fig. 2 (b), 17 of 21 numerical features resulted in JSD \u0026lt; 0.1, indicating high statistical similarity between the real and synthetic distributions. In particular, critical health indicators such as High Blood Pressure (HighBP), Body Mass Index (BMI), and History of Coronary Heart Disease or Heart Attack\u0026nbsp;(HeartDiseaseorAttack) were generated with high accuracy. However, three features: Age (JSD = 0.3523), MentHlth (JSD = 0.2613), and PhysHlth (JSD = 0.2294), has shown high divergence, which may be a consequence of their complex and extreme distributions in the real dataset. These results suggest that, although CTGAN generated synthetic data with almost excellent agreement on most features, further improvements can be achieved by techniques such as conditional sampling or discretisation for the most concerning features.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.2 Baseline ANN Performance\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo set a reference point for classification performance without the aid of evolutionary optimization, a baseline ANN model was built using a fixed architecture with two hidden layers: 128 and 64 neurons, respectively. The activation function used was ReLU, while the output layer contained a single neuron with sigmoid activation for binary classification. The training process included early stopping with validation loss monitoring to avoid overfitting.\u003c/p\u003e\n\u003cp\u003eThe model was trained on a balanced dataset with CTGAN-generated synthetic data. The baseline ANN model achieved an\u0026nbsp;accuracy of 0.7804, a precision of\u0026nbsp;0.6809, a\u0026nbsp;recall of 0.5383, an\u0026nbsp;F1 score of 0.6012, an\u0026nbsp;ROC AUC value of 0.8294, and\u0026nbsp;PR AUC value of 0.6877.\u003c/p\u003e\n\u003cp\u003eAs seen in Fig. 3 (a), results show moderate performance, with good discriminant ability (AUC ROC \u0026gt; 0.82) and acceptable recall\u0026nbsp;(recall = 0.5383), which is crucial in medical tasks where the omission of positive cases must be minimized. As shown in Fig. 3 (b), the AUC value of PR greater than 0.68 suggests that the model performs consistently even in the presence of class disparities. Fig. 3 (d) shows the confusion matrix of the reference ANN model. The results indicate that the true positives (TP = 7833) indicate that the model succeeds in identifying a significant number of samples with diabetes, but the false negatives (FN = 6719) and false positives (FP = 3671) remain high. As seen in Fig. 3 (c) the baseline ANN model has significant overlap between positive and negative classes, with many positive cases (Diabetes) having a probability \u0026le; 0.5, resulting in low recall and false negatives.\u003c/p\u003e\n\u003cp\u003eTo explore the decision-making basis of the model and to understand the relative importance of the features, SHAP analysis was used. The Fig. 4 shows the impact of each feature on model decisions for each patient:\u003c/p\u003e\n\u003cp\u003e- GenHlth was the most influential feature. Its high values increase the predicted probability of diabetes (positive SHAP), lower values pushed the prediction toward the non-diabetic class\u003c/p\u003e\n\u003cp\u003e- HighChol and HighBP also contribute significantly to skew the model toward the positive class, especially when patients present high values for these features. High BMI has a significant positive effect on the probability of diabetes, while high physical PhysActivity acting as a protective factor (reflected by negative SHAP values).\u003c/p\u003e\n\u003cp\u003eAge and income show that older and low-income individuals are more likely to be classified as diabetic, which is consistent with the epidemiological literature.\u003c/p\u003e\n\u003cp\u003eOther factors such as HeartDiseaseorAttack, Stroke, DiffWalk and PhysHlth also increase the probability of being diagnosed with diabetes, reflecting the coexistence of health conditions.\u003c/p\u003e\n\u003cp\u003eOn the other hand, features such as CholCheck, Smoker, HvyAlcoholConsump and NoDocbcCost had limited impact on the model\u0026apos;s decisions, ranking last on the importance list according to SHAP.\u003c/p\u003e\n\u003cp\u003eIn the analysis of the baseline ANN model, several features turned out to be the main contributors to the model\u0026apos;s decision-making. Specifically, features such as GenHlth, HighChol, HighBP, BMI, and Age had the greatest positive impact on the likelihood of the positive class (i.e., people with diabetes). For example, high values of GenHlth (implying poor general health) drove the model\u0026apos;s decision toward a diagnosis of diabetes, whereas low values had a protective effect. On the other hand, features such as PhysActivity and Income had a opposite impact on classification, where higher values of physical activity or higher income contributed to a decrease in the probability of being diabetic.\u003c/p\u003e\n\u003cp\u003eTo better understand the effect of data balancing with CTGAN, the JSD divergence between the real and synthetic distributions was analyzed for several features and compared with its impact on the ANN model using SHAP.\u003c/p\u003e\n\u003cp\u003e- Age (JSD = 0.3523). One of the features with the largest divergence between real and synthetic data. SHAP also confirmed a large impact on decision making, indicating that CTGAN added useful variation to this feature and that the ANN model used this information to improve classification.\u003c/p\u003e\n\u003cp\u003e- PhysHlth (JSD = 0.2294). This feature also showed high agreement between JSD and SHAP, with high values contributing significantly to classification. This supports the idea that CTGAN provided useful data that improved class distinction.\u003c/p\u003e\n\u003cp\u003e- MentHlth (JSD = 0.2613). While this feature exhibited significant JSD divergence, its impact on SHAP was more moderate. This may indicate that although CTGAN created new variation in this feature, the model did not identify it as important for decision making, perhaps due to lack of direct correlation with class or complex interactions with other features.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e4.3 Proposed Hybrid Model HybGANN\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo improve the performance of the classification model, we propose HybGANN, a novel hybrid approach. This approach preserves acquired skills and significantly improves convergence to the best solutions. Once the evolutionary process was completed, the best configuration found by the algorithm was used for the final training of the ANN model with the combined training and validation data. The model achieved an accuracy of 0.8960, a recall of 0.7510, an F1 score of 0.8171, ROC AUC value of 0.9184, and PR AUC value of 0.9268. The confusion matrix of the proposed HybGANN model, as shown in Fig. 5 (d), demonstrates a significant improvement in the classification of people with diabetes. The model achieves a better balance between recall\u0026nbsp;and accuracy due to the use of a hybrid optimization approach combining CTGAN for data balancing and GA for the optimization of neural network weights and hyperparameters. Compared to the baseline ANN model, HybGANN significantly reduces the number of classification errors and increases the ability to detect the minority class.\u003c/p\u003e\n\u003cp\u003eIn all major evaluation metrics, HybGANN outperforms the baseline ANN model described in Section 4.2. The significant increase in recall and F1 score suggests that the proposed model is more effective in identifying diabetes-positive cases, while reducing minority class classification errors. The improvement in PR AUC value as shown in Fig. 5 (b), confirms that HybGANN better manages the trade-off between accuracy and recall\u0026nbsp;in a class-imbalanced context. As shown in Fig. 5 (a), the ROC AUC value shows a slight improvement from the baseline model, while the histograms of predicted probabilities in Fig. 5 (c) show a clear separation between classes in the HybGANN compared to the baseline ANN model.\u003c/p\u003e\n\u003cp\u003eThese improvements can be attributed to the evolutionary optimization mechanism, which not only finds better configurations but also helps to avoid stalling in local minima and overfitting. The integration of the Lamarckian approach allows the model to retain the skills acquired during partial training, providing an effective combination of global exploration and local adaptation.\u003c/p\u003e\n\u003cp\u003eIn addition to its high performance, HybGANN also shows a high level of interpretability, as demonstrated by the SHAP analysis. The most influential feature turned out to be MentHlth with a wide distribution of SHAP values, from approximately -0.2 to +0.8. This shows that people with many mentHlth days have a strong influence on the prediction of the positive class, while those with few or no problems eliminate this probability. Next, GenHlth had a strong influence, especially when the general condition was poor. Clinical features such as BMI, HighChol and HighBP still contributed significantly, while Age ranked sixth in importance. Based on SHAP values, HybGANN incorporates average variations, especially for features such as MentHlth and GenHlth. For HighChol and HighBP, the impact is mainly positive, implying that the presence of these conditions significantly increases the probability of a positive diagnosis.\u003c/p\u003e\n\u003cp\u003eUsing the combined JSD-SHAP analysis, it is observed that features with large changes with respect to CTGAN (such as MentHlth and Age) also present a high significance in SHAP, indicating that the synthetic data have enriched the model\u0026rsquo;s understanding of these features. In contrast, features such as PhysHlth presented a high JSD, but a low importance in SHAP, suggesting that perhaps useless examples or outliers have been added that the model does not use effectively. On the other hand, stable features such as GenHlth, BMI and HighChol maintain a high impact, although they were not greatly affected by the CTGAN synthetic data.\u003c/p\u003e"},{"header":"5. Discussion and interpretation","content":"\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\n \u003ch2\u003e5.1 Comparative Performance Analysis\u003c/h2\u003e\n \u003cp\u003eTo evaluate the impact of the proposed HybGANN architecture, an immediate comparison was performed with the baseline ANN model, trained in the CTGAN balanced dataset but without using GA. Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e presents the final test metrics of both models. As can be noticed, HybGANN achieves improvements in all key performance metrics:\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003ePerformance Comparison Between Baseline ANN and HybGANN Models\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMetric\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBaseline ANN\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eHybGANN\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eImprovement\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.7804\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.8319\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e+\u0026thinsp;0.0515\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePrecision\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.6809\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.8960\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e+\u0026thinsp;0.2151\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRecall\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.5383\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.7510\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e+\u0026thinsp;0.2127\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eF1 Score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.6012\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.8171\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e+\u0026thinsp;0.2150\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eROC AUC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.8294\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.9184\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e+\u0026thinsp;0.0890\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePR AUC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.6877\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.9268\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e+\u0026thinsp;0.2391\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003eThe most significant improvements are seen in precision, recall, F1 score and PR AUC value, important metrics in class-imbalanced scenarios such as diabetes prediction, where positive classes are in the minority. In particular, the AUC of PR increased from 0.6877 to 0.9268, a significant change demonstrating that HybGANN has significantly fewer false positives and false negatives and achieves a better balance between recall and accuracy.\u003c/p\u003e\n \u003cp\u003eIn addition, the ROC AUC improves by almost 9 percentage points, indicating a higher discriminative ability of the HybGANN model to distinguish between classes.\u003c/p\u003e\n \u003cp\u003eOn the other hand, the high F1 score of HybGANN (0.8171 compared to 0.6012 in the baseline ANN) highlights that this model does not sacrifice recall for accuracy but achieves an optimal balance between both objectives. This is especially important in medical applications, where accurate identification of positive cases is crucial. These results confirm that the use of Lamarckian genetic optimization, together with data balancing using GAN, significantly improves the classification capability of the model in real situations with imbalanced distributions of classes.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\n \u003ch2\u003e5.2 GA\u003c/h2\u003e\n \u003cp\u003eA key element of the HybGANN architecture is the evolutionary optimization of the ANN using a GA with a Lamarckian approach. This component contributes significantly to the improvement of the overall model performance, as demonstrated in the previous comparative analysis.\u003c/p\u003e\n \u003cp\u003eUnlike traditional ANN training approaches, which rely on gradient algorithms and local optimization, evolutionary optimization globally explores the space of possible configurations, avoiding stagnation at local minima and exploring more efficiently the optimal configurations of the model [\u003cspan class=\"CitationRef\"\u003e3\u003c/span\u003e]. This process involves representing and organizing the network structure, initial weights, learning rate and batch size, which evolve through natural selection and mutation mechanisms.\u003c/p\u003e\n \u003cp\u003eIn particular, the selection of parameters such as:\u003c/p\u003e\n \u003cul\u003e\n \u003cli\u003e\n \u003cp\u003epersonalized learning rate, which is better adapted to the data structure and increases the stability of the training,\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eoptimal batch size, which affects the convergence and fluctuations of the loss function;\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eoptimized initial weights instead of random initializations,\u003c/p\u003e\n \u003c/li\u003e\n \u003c/ul\u003e\n \u003cp\u003ehave a direct and significant impact on training quality and final network performance.\u003c/p\u003e\n \u003cp\u003eIn our approach, the use of Lamarckism offers an additional advantage: during evolution, individuals are partially trained and their improved weights are incorporated into the following generations. This form of acquired inheritance increases convergence to superior solutions, effectively combining global exploration with local improvement.\u003c/p\u003e\n \u003cp\u003eThe contribution of this mechanism is reflected in the experimental results:\u003c/p\u003e\n \u003cul\u003e\n \u003cli\u003e\n \u003cp\u003eRecall increases significantly (from 0.5383 to 0.7510), improving sensitivity in minority class classification.\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eThe PR AUC value increased from 0.6877 to 0.9268, demonstrating better management of the balance between accuracy and recall.\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eThe F1 score improved from 0.6012 to 0.8171, reflecting better balance between type I and type II errors.\u003c/p\u003e\n \u003c/li\u003e\n \u003c/ul\u003e\n \u003cp\u003eBesides the increased performance, another advantage is the reduction of manual intervention. Instead of determining optimal configurations through frequent and costly hypergeneralization tests, the algorithm automatically develops the best solutions, which increases the overall efficiency of model development.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\n \u003ch2\u003e5.3 Interpretability with SHAP\u003c/h2\u003e\n \u003cp\u003eTo evaluate how our model classifies and which features have the greatest impact on classes, the SHAP method was applied to both the baseline ANN and the HybGANN framework. This analysis enables us to understand which features contribute most to the classification of an instance as diabetic and in which direction (positive or negative).\u003c/p\u003e\n \u003cdiv id=\"Sec20\" class=\"Section3\"\u003e\n \u003ch2\u003e5.3.1 Comparison of Feature Importance\u003c/h2\u003e\n \u003cp\u003eAs can be seen in Table \u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e, the HybGANN model not only retains traditional features (such as HighBP and HighChol), but also identifies new high-impact features, such as MentHlth and PhysHlth, which the baseline ANN model made virtually no use of. This suggests that CTGAN has helped enrich the data with cases with more diverse values, while GA has increased the sensitivity of the model to these features.\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003ctable id=\"Tab2\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eFeatures with the highest impact according to SHAP\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFeature\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eHybGANN (SHAP)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBaseline ANN (SHAP)\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMentHlth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0\u0026ndash;0.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026ndash;0.05\u0026ndash;0.05\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGenHlth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBMI\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0\u0026ndash;0.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026ndash;0.05\u0026ndash;0.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHighChol\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0.15\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eHighBP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0.10\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAge\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0.20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e~\u0026thinsp;0.05\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec21\" class=\"Section3\"\u003e\n \u003ch2\u003e5.3.2 Analysis of features with high JSD and their influence on model decision-making using SHAP\u003c/h2\u003e\n \u003cp\u003eTo better understand how the synthetic data generated by CTGAN may differ from the real data, the JSD was calculated for each feature. Features with high JSD values indicate larger discrepancies in the distribution of the real and synthetic data. These differences can affect model behavior and classification cut-off points, especially in sensitive domains such as medical diagnostics.\u003c/p\u003e\n \u003cp\u003eIn this subsection, we focus on the features with the greatest divergence and examine their impact on the model decision-making process using SHAP. By combining statistical analysis of divergence with model interpretability techniques, we seek to better understand how these features contribute to predictions and whether their synthetic representation introduces any bias or modifies their significance.\u003c/p\u003e\n \u003cp\u003eA combined analysis was performed using:\u003c/p\u003e\n \u003cul\u003e\n \u003cli\u003e\n \u003cp\u003eJSD: to measure the degree of change in the distribution of each feature between the real and synthetic data generated by CTGAN.\u003c/p\u003e\n \u003c/li\u003e\n \u003cli\u003e\n \u003cp\u003eSHAP: to assess the impact of each feature on the decision-making process of the model. The SHAP values of the proposed model were compared with those of the baseline ANN.\u003c/p\u003e\n \u003c/li\u003e\n \u003c/ul\u003e\n \u003cp\u003eThis combination seeks to detect features that were not important in the baseline model, but which, after optimization with GA, have acquired a decisive weight in the final model.\u003c/p\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e presents three features with high JSD values, comparing them in terms of their SHAP impact on the HybGANN and baseline ANN models. The interpretations help us understand how synthetic equilibrium has contributed to strengthening the importance of these features:\u003c/p\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003ctable id=\"Tab3\" border=\"1\" class=\"fr-table-selection-hover\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003eFeatures with highest JSD and their SHAP importance across models\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eFeature\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eJSD\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBaseline ANN(SHAP)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eHybGANN (SHAP)\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMentHlth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.2613\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e(~\u0026thinsp;0.8) very important\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e(~\u0026thinsp;0.05) unused\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePhysHlth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.2294\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e(~\u0026thinsp;0.10) medium\u0026ndash;high\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e(~\u0026thinsp;0.2) low\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eAge\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"char\"\u003e\n \u003cp\u003e0.3523\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e(~\u0026thinsp;0.20) high\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e(~\u0026thinsp;0.5) medium\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003cp\u003e\u003c/p\u003e\n \u003cp\u003eIn all of the above cases, features with high divergence JSD not only became more frequent in the generated data, but also translated into features with a real impact on HybGANN model predictions. This indicates that the CTGAN has helped to incorporate useful new information, while the GA has learned to use it to improve decision-making.\u003c/p\u003e\n \u003cp\u003eThis result suggests that high JSD values associated with an increase in SHAP for a feature are a positive signal indicating an immediate benefit in model performance, beyond just an increase in its numerical frequency.\u003c/p\u003e\n \u003cp\u003eAnalysis with SHAP confirms that proposed model not only improves classification performance, but also increases model interpretability, making HybGANN a more tractable and reliable model, especially in scenarios where interpretability is essential, such as in medical diagnostics.\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec22\" class=\"Section3\"\u003e\n \u003ch2\u003e5.3.3 Limitations and potential improvements\u003c/h2\u003e\n \u003cp\u003eThe proposed model offered improved performance and detailed interpretation of the results, but on the other hand, several limitations remain to be addressed as future work. Testing the model in an actual clinical application requires rigorous validation with new datasets from different sources, or even different populations. Also, the performance of the model is closely related to the quality of the synthetic data generated in CTGAN. As future work, other GANs for tabular data will be integrated. While the Lamarckian GA hybrid offers advanced optimization, it is also costly in terms of time, but Bayesian optimization can provide a balance between efficiency and performance.\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e"},{"header":"6. Conclusion","content":"\u003cp\u003eThis paper proposes a novel hybrid model for diabetes prediction, which integrates advanced AI methods in a way that has not been explored previously in the existing literature. The proposed model, called HybGANN, combines synthetic data generation with CTGAN for class balancing, with a hybrid GA to optimize the initial weights and tune the hyperparameters of an ANN. This approach seeks to address the difficulties that arise when managing imbalanced classes and increase the accuracy of prediction in medical applications.\u003c/p\u003e\u003cp\u003eExperiments have shown that the HybGANN model outperforms in all key classification metrics a baseline ANN network that also uses the same dataset pre-balanced by CTGAN, but does not have an integrated GA. The use of SHAP to analyze the decision-making of the model has shown that this model is not only more accurate, but also more interpretable, giving greater importance to features with different distributions than synthetic balancing. Features with high JSD, such as MentHlth, PhysHlth, and Age, have shown a significant increase in their importance in the model\u0026rsquo;s decision-making, reflecting the positive impact of the data generated by CTGAN.\u003c/p\u003e\u003cp\u003eThis combination of improved data quality and an optimized learning mechanism indicates a breakthrough in building smarter systems for medical diagnostics. The results highlight the importance of a comprehensive approach to produce models that are not only more efficient, but also more reliable for practical use.\u003c/p\u003e\u003cp\u003eThis paper also prepares the ground for future research that could extend this approach to other medical domains or explore other variants of GANs or evolutionary techniques.\u003c/p\u003e\u003cp\u003eThis comprehensive analysis shows that HybGANN not only improves performance compared to traditional models, but also manages to identify new factors of diagnostic importance.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"601\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003eANN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eArtificial Neural Network\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eAnyHealthcare\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eHas Any Form of Healthcare Coverage\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eBMI\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eBody Mass Index\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eCholCheck\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eCholesterol Check (had cholesterol checked recently)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003eCTGAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eConditional Tabular Generative Adversarial Network\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eDiffWalk\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eDifficulty Walking\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eEducation\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eEducation Level\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eFruits\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eFruit Consumption\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003eGA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eGenetic Algorithm\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003eGAN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eGenerative Adversarial Network\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eGenHlth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eGeneral Health\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eHeartDiseaseorAttack\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eHistory of Coronary Heart Disease or Heart Attack\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eHighBP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eHigh Blood Pressure\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eHighChol\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eHigh Cholesterol\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eHvyAlcoholConsump\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eHeavy Alcohol Consumption\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003eHybGANN\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eHybrid GA with ANN using GAN\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003eJSD\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eJensen-Shannon Divergence\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eMentHlth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eMental Health\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eNoDocbcCost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eNo Doctor Because of Cost\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003ePCA\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003ePrincipal Components Analysis\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003ePhysActivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003ePhysical Activity\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003ePhysHlth\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003ePhysical Health\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003ePR AUC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003ePrecision-Recall Area Under the Curve\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003eROC AUC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eeceiver Operating Characteristic \u0026ndash; Area Under the Curve\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd style=\"width: 198px;\"\u003e\n \u003cp\u003eSHAP\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eSHapley Additive exPlanations\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eSmoker\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eSmoking Status\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eStroke\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eHistory of Stroke\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 198px;\"\u003e\n \u003cp\u003eVeggies\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 403px;\"\u003e\n \u003cp\u003eVegetable Consumption\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets generated and/or analysed during the current study are available in the UC Irvine Machine Learning Repository repository, CDC Diabetes Health Indicators. \u0026nbsp;https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that this research works have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll authors contributed equally to the conceptualization, methodology, analysis, and writing of the manuscript. All authors have read and approved the final version.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eSalmi, Mabrouka, Dalia Atif, Diego Oliva, Ajith Abraham, and Sebastian Ventura. \u0026quot;Handling imbalanced medical datasets: review of a decade of research.\u0026quot; Artificial Intelligence Review 57, no. 10 (2024): 273.\u003c/li\u003e\n \u003cli\u003eRamesh, Jayroop, Raafat Aburukba, and Assim Sagahyroon. \u0026quot;A remote healthcare monitoring framework for diabetes prediction using machine learning.\u0026quot; Healthcare Technology Letters 8, no. 3 (2021): 45-57.\u003c/li\u003e\n \u003cli\u003eSampson, Jeffrey R. \u0026quot;Adaptation in natural and artificial systems (John H. Holland).\u0026quot; (1976): 529.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eBergstra, James, and Yoshua Bengio. \u0026quot;Random search for hyper-parameter optimization.\u0026quot; The journal of machine learning research 13, no. 1 (2012): 281-305.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eRajendra, Priyanka, and Shahram Latifi. \u0026quot;Prediction of diabetes using logistic regression and ensemble techniques.\u0026quot; Computer Methods and Programs in Biomedicine Update 1 (2021): 100032.\u003c/li\u003e\n \u003cli\u003eAzad, Chandrashekhar, Bharat Bhushan, Rohit Sharma, Achyut Shankar, Krishna Kant Singh, and Aditya Khamparia. \u0026quot;Prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus.\u0026quot; Multimedia Systems (2022): 1-19.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eChawla, Nitesh V., Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. \u0026quot;SMOTE: synthetic minority over-sampling technique.\u0026quot; Journal of artificial intelligence research 16 (2002): 321-357.\u003c/li\u003e\n \u003cli\u003eFigueira, Alvaro, and Bruno Vaz. \u0026quot;Survey on synthetic data generation, evaluation methods and GANs.\u0026quot; Mathematics 10, no. 15 (2022): 2733.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eSauber-Cole, Rick, and Taghi M. Khoshgoftaar. \u0026quot;The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey.\u0026quot; Journal of Big Data 9, no. 1 (2022): 98.\u003c/li\u003e\n \u003cli\u003eBourou, Stavroula, Andreas El Saer, Terpsichori-Helen Velivassaki, Artemis Voulkidis, and Theodore Zahariadis. \u0026quot;A review of tabular data synthesis using GANs on an IDS dataset.\u0026quot; Information 12, no. 09 (2021): 375.\u003c/li\u003e\n \u003cli\u003eBorisov, Vadim, Tobias Leemann, Kathrin Se\u0026radic;\u0026uuml;ler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. \u0026quot;Deep neural networks and tabular data: A survey.\u0026quot; IEEE transactions on neural networks and learning systems (2022).\u003c/li\u003e\n \u003cli\u003eXu, Lei, M. Skoularidou, A. Cuesta-Infante, and K. Veeramachaneni. \u0026quot;Modeling tabular data using conditional gan. arXiv 2019.\u0026quot; arXiv preprint arXiv:1907.00503 1 (2019).\u003c/li\u003e\n \u003cli\u003eGulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C. Courville. \u0026quot;Improved training of wasserstein gans.\u0026quot; Advances in neural information processing systems 30 (2017).\u003c/li\u003e\n \u003cli\u003eRajabi, Amirarsalan, and Ozlem Ozmen Garibay. \u0026quot;Tabfairgan: Fair tabular data generation with generative adversarial networks.\u0026quot; Machine Learning and Knowledge Extraction 4, no. 2 (2022): 488-501.\u003c/li\u003e\n \u003cli\u003eIranmanesh, Seyed Mehdi, and Nasser M. Nasrabadi. \u0026quot;HGAN: Hybrid generative adversarial network.\u0026quot; Journal of Intelligent \u0026amp; Fuzzy Systems 40, no. 5 (2021): 8927-8938.\u003c/li\u003e\n \u003cli\u003eGayathri, R. G., Atul Sajjanhar, and Yong Xiang. \u0026quot;Hybrid deep learning model using spcagan augmentation for insider threat analysis.\u0026quot; Expert Systems with Applications 249 (2024): 123533.\u003c/li\u003e\n \u003cli\u003eBrock, Andrew, Jeff Donahue, and Karen Simonyan. \u0026quot;Large scale GAN training for high fidelity natural image synthesis.\u0026quot; arXiv preprint arXiv:1809.11096 (2018).\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eT\u0026uuml;may Ateş, K\u0026uuml;bra, İbrahim Erdem Kalkan, and Cenk Şahin. \u0026quot;Training Artificial Neural Network with a Cultural Algorithm.\u0026quot; Neural Processing Letters 56, no. 5 (2024): 225.\u003c/li\u003e\n \u003cli\u003eArroyo, Jan Carlo T., and Allemar Jhone P. Delima. \u0026quot;An optimized neural network using genetic algorithm for cardiovascular disease prediction.\u0026quot; Journal of Advances in Information Technology 13, no. 1 (2022).\u003c/li\u003e\n \u003cli\u003eFelizardo, Virginie, Nuno M. Garcia, Nuno Pombo, and Imen Megdiche. \u0026quot;Data-based algorithms and models using diabetics real data for blood glucose and hypoglycaemia prediction-a systematic literature review.\u0026quot; Artificial Intelligence in Medicine 118 (2021): 102120.\u003c/li\u003e\n \u003cli\u003eJolliffe, Ian. \u0026quot;Principal component analysis.\u0026quot; In International encyclopedia of statistical science, pp. 1094-1096. Springer, Berlin, Heidelberg, 2011.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eJ. Lin, \u0026quot;Divergence measures based on the Shannon entropy,\u0026quot; in IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145-151, Jan. 1991, doi: 10.1109/18.61115.\u003c/li\u003e\n \u003cli\u003eLundberg, Scott M., and Su-In Lee. \u0026quot;A unified approach to interpreting model predictions.\u0026quot; Advances in neural information processing systems 30 (2017).\u003c/li\u003e\n \u003cli\u003eLundberg, Scott M., Gabriel Erion, Hugh Chen, Alex DeGrave, Jordan M. Prutkin, Bala Nair, Ronit Katz, Jonathan Himmelfarb, Nisha Bansal, and Su-In Lee. \u0026quot;From local explanations to global understanding with explainable AI for trees.\u0026quot; Nature machine intelligence 2, no. 1 (2020): 56-67.\u003c/li\u003e\n \u003cli\u003eDavis, Jesse, and Mark Goadrich. \u0026quot;The relationship between Precision-Recall and ROC curves.\u0026quot; In Proceedings of the 23rd international conference on Machine learning, pp. 233-240. 2006.\u003c/li\u003e\n \u003cli\u003eFawcett, Tom. \u0026quot;An introduction to ROC analysis.\u0026quot; Pattern recognition letters 27, no. 8 (2006): 861-874.\u003c/li\u003e\n \u003cli\u003eYao, Xin. \u0026quot;Evolving artificial neural networks.\u0026quot; Proceedings of the IEEE 87, no. 9 (1999): 1423-1447.\u003c/li\u003e\n \u003cli\u003eSingh, Vijendra, Vijayan K. Asari, and Rajkumar Rajasekaran. \u0026quot;A deep neural network for early detection and prediction of chronic kidney disease.\u0026quot; Diagnostics 12, no. 1 (2022): 116.\u003c/li\u003e\n \u003cli\u003eFeng, Xin, Yihuai Cai, and Ruihao Xin. \u0026quot;Optimizing diabetes classification with a machine learning-based framework.\u0026quot; BMC bioinformatics 24, no. 1 (2023): 428.\u003c/li\u003e\n \u003cli\u003eJaiswal, Sushma, and Priyanka Gupta. \u0026quot;GLSTM: a novel approach for prediction of real \u0026amp; synthetic PID diabetes data using GANs and LSTM classification model.\u0026quot; Int J Exp Res Rev 30 (2023): 32-45.\u003c/li\u003e\n \u003cli\u003eMishra, Sushruta, Hrudaya Kumar Tripathy, Pradeep Kumar Mallick, Akash Kumar Bhoi, and Paolo Barsocchi. \u0026quot;EAGA-MLP-an enhanced and adaptive hybrid classification model for diabetes diagnosis.\u0026quot; Sensors 20, no. 14 (2020): 4036.\u003c/li\u003e\n \u003cli\u003ePekel \u0026Ouml;zmen, Ebru, and Tuncay \u0026Ouml;zcan. \u0026quot;Diagnosis of diabetes mellitus using artificial neural network and classification and regression tree optimized with genetic algorithm.\u0026quot; Journal of Forecasting 39, no. 4 (2020): 661-670.\u003c/li\u003e\n \u003cli\u003eVu, Ly, and Quang Uy Nguyen. \u0026quot;Handling imbalanced data in intrusion detection systems using generative adversarial networks.\u0026quot; Journal of Research and Development on Information and Communication Technology 2020, no. 1 (2020): 1-13.\u003c/li\u003e\n \u003cli\u003eXiao, Yawen, Jun Wu, and Zongli Lin. \u0026quot;Cancer diagnosis using generative adversarial networks based on deep learning from imbalanced data.\u0026quot; Computers in Biology and Medicine 135 (2021): 104540.\u0026nbsp;\u003c/li\u003e\n \u003cli\u003eTeboul, Alex. 2021. Diabetes Health Indicators Dataset. Kaggle. https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"CTGAN, Lamarckian GA, ANN, SHAP, Medical data mining","lastPublishedDoi":"10.21203/rs.3.rs-7300855/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7300855/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe digitization of medical data has enabled large-scale analysis. However, clinical datasets, such as those used for diabetes prediction, often have class imbalances, with disease cases significantly underrepresented. This imbalance poses a major challenge for traditional machine learning models, which tend to favor the majority classes. In addition, many high-performance models operate as black boxes, limiting their adoption in clinical practice due to their lack of interpretability. In this paper, we present HybGANN, a novel hybrid framework that integrates Conditional Tabular Generative Conditional Networks (CTGAN) for synthetic minority data generation, a unique hybrid genetic algorithm (GA) that co-evolves hyperparameters and internal weights from artificial neural networks (ANNs) in a Lamarckian fashion, and SHapley Additive Explanations (SHAP) for post-hoc model interpretability. In contrast to previous work, to the best of our knowledge, this is the first application of a Lamarckian GA for the optimization of node weights and hyperparameters in tabular medical data classification. HybGANN creates a semi-automated workflow that improves predictive performance while providing transparency and adaptability. Applied to a large-scale diabetes dataset, experiments have demonstrated that the HybGANN model outperforms a benchmark ANN network that also uses the same CTGAN pre-balanced dataset on all key classification metrics. The framework achieves a ROC-AUC value of 0.9184 and a PR-AUC value of 0.9268, demonstrating its effectiveness and potential as a reliable AI solution for clinical decision support in imbalanced medical fields.\u003c/p\u003e","manuscriptTitle":"HybGANN: A Hybrid GAN-GA-ANN Framework for Predicting Diabetes from Imbalanced Medical Data","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-22 10:05:53","doi":"10.21203/rs.3.rs-7300855/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"85b177e1-eb41-4d31-a235-f9a143f96508","owner":[],"postedDate":"September 22nd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2025-11-10T11:41:35+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-22 10:05:53","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7300855","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7300855","identity":"rs-7300855","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.