Comparative Study of Machine Learning Techniques for Diabetes Forecasting

doi:10.21203/rs.3.rs-7145782/v1

Comparative Study of Machine Learning Techniques for Diabetes Forecasting

2025 · doi:10.21203/rs.3.rs-7145782/v1

preprint OA: closed

Full text JSON View at publisher

Full text 95,113 characters · extracted from preprint-html · click to expand

Comparative Study of Machine Learning Techniques for Diabetes Forecasting | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Comparative Study of Machine Learning Techniques for Diabetes Forecasting Abdul Aamir Khan, Dr. Bk Sharma This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7145782/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract The rising global prevalence of diabetes has intensified the need for accurate and early diagnostic systems. As a significant global health concern, diabetes requires effective and precise prediction techniques. This study reviews research that utilizes clinical data and machine learning (ML) approaches for diabetes prediction. Common pre-processing steps include categorical data encoding, handling missing values, and normalization. To enhance model performance, dimensionality reduction techniques such as Principal Component Analysis (PCA) and feature selection are employed. Performance metrics—such as accuracy, precision, recall, F1-score, and AUC-ROC—are used to evaluate and compare various supervised learning algorithms, including Random Forest, Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), Logistic Regression, and Decision Trees. Many studies use small datasets, which limits generalizability despite reporting high accuracy. This study underscores the need for diverse datasets and clinically interpretable models, while also highlighting gaps in model interpretability and validation practices. Medical Genetics Clinical Data Diabetes Prediction Healthcare Logistic Regression Machine Learning Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Diabetes mellitus is a chronic metabolic disorder characterized by abnormally high blood glucose levels, which, if left unmanaged, can lead to severe complications. According to the World Health Organization (WHO), approximately 422 million people worldwide were living with diabetes in 2014, highlighting a growing public health concern. This rising prevalence underscores the urgent need for effective prediction and management strategies. Traditional diagnosis and management rely heavily on clinical assessments and laboratory tests, which can be time-consuming and sometimes cumbersome. Recent advances in machine learning (ML) have introduced innovative methods for healthcare data analysis and predictive modeling. By analyzing large datasets, machine learning algorithms can identify complex patterns and generate highly accurate predictions. This capability is particularly valuable for early detection and intervention in diabetes. Common datasets used in diabetes prediction research include the Pima Indians Diabetes Database and the UCI Diabetes Dataset. In recent years, many researchers have applied a variety of machine learning techniques to diabetes prediction, including logistic regression, decision trees, random forests, support vector machines, and neural networks. Studies utilizing the Pima Indians Diabetes Database have reported accuracy rates above 85% for models such as Random Forests and Support Vector Machines. Moreover, feature selection methods like recursive feature elimination and correlation analysis have been employed to improve model performance by identifying the most predictive features. Despite promising results, variations in methodologies, evaluation metrics, and datasets across studies make direct comparisons challenging. The application of machine learning for diabetes prediction remains an evolving field, with emerging computational approaches aimed at improving the accuracy and effectiveness of risk assessment. With over 400 million people affected worldwide and a continuing rise in incidence, there is a critical need for novel approaches that can identify diabetes before clinical symptoms manifest. Researchers are focusing on combining machine learning algorithms with large-scale datasets to uncover key risk factors and enhance early detection techniques, ultimately improving patient outcomes and healthcare management. Advanced machine learning methods such as ensemble learning, deep learning, and semi-supervised learning are increasingly employed in this domain. These techniques leverage both labeled and unlabeled data to build robust, adaptable models tailored to individual patient characteristics. Key aspects of this approach include feature selection, handling data imbalances, and selecting appropriate evaluation metrics to ensure model reliability and effectiveness. Additionally, integrating lifestyle and demographic data provides a comprehensive understanding of diabetes risk. Despite its potential, the field faces significant challenges, particularly regarding data privacy and model interpretability. The reliance on sensitive patient data raises security and ethical concerns, prompting the adoption of privacy-preserving technologies such as federated learning and block chain. Furthermore, making machine learning models interpretable is essential to ensure their acceptance and effective use by healthcare professionals in clinical settings. Going forward, advancing diabetes prediction requires interdisciplinary collaboration and the integration of cutting-edge technologies. Ongoing research aims to improve model applicability, interpretability, and accuracy across diverse populations, thereby enhancing diabetes prevention and care. Collaboration among scientists, clinicians, and technology experts is critical to developing solutions that address both the complexities of diabetes and the evolving landscape of machine learning in healthcare. As illustrated in Fig. 1 , diabetes prediction faces several challenges, including data imbalance, interpretability, and model generalizability Literature Review The study “Predictive Modeling for Diabetes Using Machine Learning” (2024) by Shriya Aishani Rachakonda et al. explores the use of machine learning algorithms to forecast diabetes development based on diagnostic criteria. The work employs supervised classification methods including Random Forest, k-Nearest Neighbors (k-NN), Decision Trees, Support Vector Machines (SVM), and Logistic Regression. Among these, the Random Forest algorithm performed best, achieving 84% accuracy, 83% precision, 78% recall, an 80% F1-score, and an AUC of 0.86. The model was trained and tested using clinical data from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) dataset [ 9 ].In Machine Learning-Based Diabetes Prediction System (2023), Kundan Kumar et al. reported that machine learning methods such as Support Vector Machines, NumPy-based implementations, and Decision Trees can predict diabetes with 80% accuracy. The Support Vector Machine classifier specifically achieved an accuracy of 80%, demonstrating the effectiveness of machine learning in classifying diabetic and non-diabetic individuals [ 10 ]. B. Ahamed et al. (2022) in their study “Predictive Modeling Using Machine Learning Techniques and Classifiers for Diabetes Mellitus Disease Prediction and Type Classification” proposed several supervised learning techniques to create a reliable diabetes prediction model. These included Random Forest, k-Nearest Neighbors (k-NN), Decision Trees, Support Vector Machines (SVM), and Logistic Regression. The Random Forest algorithm again outperformed other classifiers, reaching 84% accuracy, 83% precision, 78% recall, 80% F1-score, and 0.86 AUC, evaluated on the Pima Indians Diabetes Dataset and survey data [ 11 ]. Sony M. et al. (2022), in their work “Machine Learning-Based Diabetes Prediction: Examination of 70,000 Clinical Database Patient Records,” identified Random Forest as an effective method to precisely predict diabetes. The study highlighted the growing prevalence of diabetes worldwide, projecting that by 2040, approximately 642 million people will be affected. Their analysis was based on the Pima Indian Diabetes Dataset and the Diabetes 130-US Hospitals dataset from 1999–2008 [ 12 ]. R. Krishnamoorthi and colleagues (2022) proposed “A New Framework for Predicting Diabetes in Healthcare Using Machine Learning Methods,” which achieved an accuracy of 83% using innovative machine learning techniques. The proposed model, which attained an accuracy score of 86%, offers valuable insights for researchers, stakeholders, students, and health professionals engaged in diabetes prediction research [ 13 ]. V. Yamana (2022), in “Predicting Diabetes Through Machine Learning Algorithms,” reported that machine learning algorithms can predict up to 90% of diabetes cases. The study examined issues such as model overfitting and underfitting to understand why some classifiers performed poorly, and achieved a consistent and optimal diabetes prediction accuracy of 90% [ 14 ]. In 2021, Arwatki Chen Lyngdoh et al. evaluated five machine learning algorithms for diabetes prediction. The k-NN classifier achieved an accuracy of up to 76%. The study investigated the causes of inconsistent classifier performance by analyzing training and testing accuracy to detect overfitting and underfitting. The dataset included risk factors and outcomes related to diabetes [ 15 ]. Kaur Harleen et al. (2021), in their study “Machine Learning-Based Predictive Modeling and Analytics for Diabetes,” developed and evaluated five machine learning models—including SVM, k-NN, ANN, and MDR—using the Pima Indian diabetes dataset. Their findings showed that the SVM-linear model achieved the highest accuracy (0.89) and precision (0.88), while the k-NN model had the best recall and F1-score (0.90 and 0.88, respectively). The AUC values for these models were 0.90 and 0.92 [ 16 ]. Sharma Amandeep et al. (2021) proposed a machine learning model combining logistic regression, ANN, Naïve Bayes, and decision trees for diabetes prediction. The decision tree model achieved an accuracy of 76.52%, Naïve Bayes 76.95%, logistic regression 80.43%, and the artificial neural network classifier achieved 75.21% accuracy on the Pima Indian diabetes dataset [ 17 ]. Table 1 summarizes existing studies on diabetes prediction models, including datasets used and accuracy metrics. Table 1 Overview of Existing Studies on Diabetes Prediction Models: Paper Abstract summary Main Findings Accuracy Dataset Predictive Modelling For Diabetes Using Machine Learning, 2024, Shriya Aishani Rachakonda et. al. [ 9 ] The paper applies machine learning based algorithms to predict the occurrence of diabetes by using diagnostic characteristics. the study employed supervised learning classification techniques, such as Random Forest, k-Nearest Neighbours (k-NN), Decision Trees, Support Vector Machines (SVM), and Logistic Regression. With an accuracy of 84%, precision of 83%, recall of 78%, F1-score of 80%, and AUC of 0.86, the Random Forest algorithm performed the best. the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) dataset. Diabetes Prediction System Using Machine Learning, 2023, Kundan Kumar et.al [ 10 ] Diabetes may be predicted with 80% accuracy using machine learning techniques like Decision Trees, Numpy, and Support Vector Machine. Diabetes was predicted by the Support Vector Machine Classifier with 80% accuracy. People with and without diabetes can be efficiently categorised using machine learning modelling. 80% Not mentioned Diabetes Mellitus Disease Prediction and Type Classification Involving Predictive Modeling Using Machine Learning Techniques and Classifiers, 2022, B. Ahamed et. Al. [ 11 ] The study offers machine learning methods for accurately predicting and categorising diabetes mellitus. A reliable predictive model of diabetes for this particular study was made using several supervised learning classification methods: Logistic Regression, Support Vector Machines (SVM), Decision Trees, k-Nearest Neighbours (k-NN), Random Forest The highest performance was achieved by the Random Forest algorithm with accuracy of 84%, precision = 83%, recall = 78%, F1-score = 80%, and AUC = 0.86. the Pima Indians Diabetes Dataset, and survey. Prediction of Diabetes Using Machine Learning: Analysis of 70,000 Clinical Database Patient Record, 2022, Sony M et.al. [ 12 ] Random forest is one of the machine learning algorithms that can be used to accurately predict diabetes. The number of patients suffering from diabetes is expected to reach 642 million globally by 2040, indicating the growing prevalence of the disease. Diabetes 130-US hospitals for the years 1999–2008 Data Set and the Pima Indian Diabetes Dataset. A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques, 2022 R. Krishnamoorthi et.al [ 13 ] The study suggests an 83% accurate paradigm for diabetes prediction based on clever machine learning. Health professionals, stakeholders, students, and academics working on diabetes prediction research and development can all benefit from the study's conclusions. The accuracy score of the suggested machine learning-based architecture was 86%. Survey Diabetes Disease Prediction By Using Machine Learning Algorithms,2022, V.Yamuna. [ 14 ] Up to 90% of diabetes cases can be predicted using machine learning algorithms. The study looked at model overfitting and underfitting to determine why some machine learning classifiers performed poorly. The study used machine learning algorithms to predict diabetes with a consistent and best accuracy of 90%. Not mentioned Diabetes Disease Prediction Using Machine Learning Algorithms,2021, Arwatki Chen Lyngdoh et.al. [ 15 ] The study assesses five machine learning algorithms for diabetic illness prediction; the KNN classifier achieves an accuracy of up to 76%. By analysing training and testing accuracy and searching for indications of overfitting or underfitting, the study determined why certain classifiers failed to reach consistent and high accuracy. Achieved the accuracy of 76% diabetes dataset containing information on risk factors and outcomes related to diabetes. Predictive modelling and analytics for diabetes using a machine learning approach,2021, Harleen Kaur et.al. [ 16 ] The research develops and assesses five alternative machine learning models to predict and categorise diabetes in the Pima Indian diabetes dataset. In order to categorise patients as either diabetes or non-diabetic, the study created and examined five distinct machine learning models: SVM, k-NN, ANN, and MDR. SVM-linear model provides best accuracy of 0.89 and precision of 0.88, k-NN model provided best recall and F1scoreof0.90and0.88, AUC value of SVM- linear and k-NN model are 0.90 and 0.92 the Pima Indian diabetes dataset, Prediction of Diabetes Disease Using Machine Learning Model,2021, Amandeep Sharma et.al. [ 17 ] The research proposes a machine learning model to predict diabetes utilising methods including decision tree, Naïve Bayes, ANN, and logistic regression. Using supervised machine learning algorithms such as logistic regression, Naïve Bayes, decision trees, and artificial neural networks, the study created a model for diabetes prediction. logistic regression displays 80.43% accuracy, Naïve Bayes algorithm is 76.95%, decision tree algorithm has an accuracy of 76.52%, Artificial neural network classifier has 75.21% accuracy the Pima Indian diabetes dataset, Research Gap Despite significant advances in applying machine learning to diabetes prediction, several gaps remain in the literature. First, the datasets used lack diversity; most studies rely heavily on the Pima Indians Diabetes Database and similar datasets. These datasets represent only a subset of the diabetic population and may not accurately reflect the broader, more diverse global population. Consequently, the generalizability of the developed models is questionable. Additionally, although many algorithms achieve high accuracy, there is limited emphasis on the interpretability of the models and their clinical applicability. Moreover, the validation methods employed in many studies are not sufficiently robust, often relying on small sample sizes or lacking external validation. This can lead to over fitting and reduce the model’s ability to generalize effectively in real-world scenarios. Methodology The approaches used for diabetes prediction—and consequently the methodologies employed in the reviewed studies—vary significantly. Common practices typically include data preprocessing steps such as normalization, handling missing values, and encoding categorical variables. Dimensionality reduction techniques, including feature selection and Principal Component Analysis (PCA), are frequently applied to enhance model performance. Additionally, various statistical tests may be used to assist in feature selection. Studies also differ in their choice of machine learning algorithms, with some comparing multiple algorithms to identify the best model for their dataset. Model performance is generally evaluated using metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC-ROC), providing insight into the strengths and weaknesses of each algorithm. The main steps of the suggested methodology include: Data Cleaning Addressing missing values and ensuring correct data types to enable efficient analysis. Data Processing This is a critical step when using datasets like the Pima Indians Diabetes Database, where effective management of missing values is essential. Key features considered in this study include blood pressure, skin thickness, insulin, glucose, and body mass index. Normalization and Standardization Scaling the data to ensure that every feature contributes equally to the model’s performance. Model accuracy varies significantly across studies. For example, one model achieved an accuracy of 93.22% on the Pima Indians dataset and 98.95% on the Mendeley dataset. Generally, an accuracy of around 78% is considered good, while 81% or higher is regarded as exceptional—especially when validated using 10-fold cross-validation. Figure 2 illustrates the standard methodology, highlighting key steps such as data preprocessing, feature selection, model training, and evaluation. In Fig. 2 , the standard methodology used in studies predicting diabetes with machine learning is depicted as a flowchart. It outlines critical steps leading to the final prediction, including data preprocessing, dimensionality reduction, algorithm selection, model training, and evaluation. This systematic approach helps improve both the accuracy and reliability of the predictive models. The machine learning-based methodology for diabetes prediction involves several algorithms that work with both labeled and unlabeled data. Special emphasis is placed on semi-supervised learning techniques, which are particularly beneficial when obtaining labeled data is costly or limited. By incorporating unlabeled data alongside labeled instances, these methods enhance learning efficiency and enable training on larger datasets. For example, integrating significant amounts of unlabeled data using Laplacian SVM (LapSVM) has been shown to improve both accuracy and generalization capabilities [ 1 ].The literature review, which analyzed a carefully selected dataset of 2,351 publications, was structured following the standards established by Marcus et al. [45]. A systematic approach using Term Frequency-Inverse Document Frequency (TF-IDF) was employed for text analysis [ 2 ][ 3 ]. To maintain objectivity and reduce bias, a four-point scale was developed for both qualitative and quantitative evaluations. Lifestyle factors and demographic information emerged as important predictors in diabetes risk. Studies highlighted age and gender as significant variables affecting metabolic health and insulin sensitivity [ 2 ]. Additionally, models incorporated modifiable lifestyle characteristics such as physical activity and smoking, emphasizing the need for dynamic prediction models that adapt to changing health behaviors [ 2 ]. Increasingly, researchers are combining data from multiple sources—such as lifestyle and genetic information—to create more comprehensive risk assessments [ 2 ].Data imbalance, a common challenge in classification problems, was addressed using various strategies aimed at rebalancing the dataset distribution. Techniques like cost-sensitive learning, ensemble methods, and re-sampling were applied to improve model generalization and performance [ 1 ][ 2 ]. These approaches are vital to ensuring fair representation of minority groups in predictions and mitigating biases caused by uneven sample sizes across categories [ 1 ].Model performance was assessed using key metrics such as accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic curve (AUC). Together, these metrics provide a comprehensive evaluation of the model’s effectiveness. Precision and recall offer insight into the model’s ability to correctly identify positive cases, while accuracy measures the overall proportion of correct predictions [ 19 ][ 20 ].One of the most widely used datasets in diabetes prediction research is the Pima Indians Diabetes Database. It contains medical diagnostic data for 768 female patients of Pima Indian heritage aged 21 or older. The dataset includes nine attributes relevant to diabetes classification—eight predictors and one target variable. Its extensive use in developing and evaluating machine learning algorithms has made it a benchmark dataset in this field [ 20 ].The dataset comprises numerous clinical and biometric parameters necessary for diabetes diagnosis, including blood pressure, age, body mass index (BMI), and glucose levels. The objective is to use these features to diagnose and predict diabetes status [ 21 ]. Notably, the dataset is imbalanced, with 500 negative (non-diabetic) cases and 268 positive (diabetic) cases, posing challenges for predictive modeling due to potential biases in the results. Finally, an overview of the main machine learning techniques frequently employed for diabetes prediction is provided. As shown in Fig. 3 , machine learning algorithms are widely used for diabetes prediction because they can identify complex patterns within clinical data. Logistic Regression (LR) is a popular statistical model for binary classification, estimating the probability that a given input belongs to a particular class, such as diabetic or non-diabetic. k-Nearest Neighbors (k-NN) is a simple yet effective instance-based algorithm that classifies new data points based on the majority class among their closest neighbors. Decision Trees (DT) build a tree structure where internal nodes represent decisions, branches represent possible outcomes, and leaves correspond to class labels. Random Forest (RF) improves accuracy and reduces over fitting by combining multiple decision trees into an ensemble method. In high-dimensional spaces, Support Vector Machines (SVM) find the hyper plane that maximizes the margin separating classes. Naive Bayes (NB), based on Bayes’ theorem and assuming feature independence, is simple but often achieves strong performance. Ensemble techniques such as Gradient Boosting Machines (GBM) and XGBoost sequentially build models, with each new model correcting errors made by the previous one, making them particularly effective for structured data. Deep Learning (DL), utilizing artificial neural networks with multiple layers, excels at capturing complex relationships in large datasets and has demonstrated high accuracy in diabetes prediction. To thoroughly evaluate model performance, metrics such as Accuracy, Precision, Recall, and F-measure, derived from the confusion matrix, are commonly used. Additionally, more advanced metrics like Sensitivity, Specificity, Area Under the ROC Curve (AUC), and Matthews Correlation Coefficient (MCC) provide deeper insight into model behavior. These diverse metrics help balance trade-offs between different types of errors (e.g., false positives vs. false negatives), which is crucial for selecting the best model for practical and clinical applications in diabetes prediction. Given the data-driven nature of machine learning, access to diverse and comprehensive datasets is vital to improving model performance. Future research should focus on collecting extensive data from varied demographics, considering multiple factors such as medical conditions and lifestyle choices that influence diabetes risk. Collaborative sharing of anonym zed patient data across institutions will facilitate the development of more generalizable models suitable for a broad range of populations. Conclusion Drawing insights from a range of recent studies, this review critically examines the application of various machine learning (ML) algorithms in diagnosing diabetes. It is clear that ML techniques such as Random Forest, Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), Logistic Regression, and Deep Learning are highly effective for prediction. These models, when trained on structured clinical datasets like the Pima Indians Diabetes Database, commonly achieve high accuracy, precision, and AUC scores. Notably, ensemble and deep learning models have consistently outperformed traditional classifiers in predictive ability. Despite these promising results, several research limitations persist. A key limitation is the lack of generalizability, primarily due to the reliance on benchmark datasets that are rarely diverse. Additionally, interpretability—which is crucial for clinical acceptance—is generally under-addressed. Moreover, relatively few studies use external validation strategies to ensure the models’ robustness across different populations and clinical settings; this vulnerability remains largely unaddressed in most current publications. As shown in Table 2 , different ML models report varying accuracy levels across datasets, with Random Forest and k-NN demonstrating particularly promising results. Table 2 Table summarizing the algorithms, datasets used, and reported accuracy rates Algorithm(s) Dataset Accuracy (%) Random Forest, k-NN, Decision Trees, SVM, Logistic Regression NIDDK Dataset 84% SVM, Numpy, Decision Trees Not Specified 80% Random Forest, k-NN, Decision Trees, SVM, Logistic Regression Pima Indians Diabetes Dataset + Survey 84% Random Forest Pima Indians Dataset, Diabetes 130-US hospitals (1999–2008) Not Specified Not specified (general ML-based architecture) Not Specified 83%, proposed model: 86% Not specified Not Specified 90% k-NN and others Diabetes dataset with risk factors and consequences 76% (k-NN) SVM (Linear), k-NN, ANN, MDR Pima Indian Diabetes Dataset SVM: 89%, k-NN: 90% Logistic Regression, ANN, Naïve Bayes, Decision Trees Pima Indian Diabetes Dataset LR: 80.43%, NB: 76.95%, DT: 76.52%, ANN: 75.21% The figure illustrates key performance metrics—Accuracy, Precision, Recall, and F1-Score—used to visually compare different machine learning algorithms for diabetes prediction. Future Scope To enhance the clinical applicability and relevance of machine learning (ML) models for diabetes prediction, future research must address several key areas. First and foremost, dataset diversification is essential. Future studies should leverage large, multi-center, real-world datasets—such as electronic health records and longitudinal data—to improve model generalizability and reduce biases associated with limited or homogeneous datasets. Interpretability is another critical aspect. The adoption of explainable AI (XAI) techniques will be vital in making ML predictions transparent and understandable, which is crucial for gaining clinicians’ trust and promoting the clinical adoption of these models. Moreover, more rigorous validation methods are necessary to ensure model robustness and generalizability. This includes employing stringent validation protocols, such as external dataset validation and k-fold cross-validation, which help minimize over fitting and provide a more accurate evaluation of model performance. Finally, integrating these ML platforms seamlessly into clinical systems should be prioritized. Future tools must be designed for smooth incorporation into existing electronic health record systems and clinical decision support systems, enabling real-time risk stratification and diagnosis aligned with current clinical workflows. References Chou C-Y, Hsu D-Y, Chou C-H (2023) Predicting the onset of diabetes with machine learning methods, Journal of Personalized Medicine , vol. 13, no. 3, p. 406, Feb. 10.3390/jpm13030406 Kalla D, Smith N, Samaah F, Polimetla K (Mar. 2022) Enhancing early diagnosis: Machine learning applications in diabetes prediction. J Artif Intell Cloud Comput 1–7. 10.47363/jaicc/2022(1)191 Qin Y et al (2022) Machine Learning Models for Data-Driven Prediction of Diabetes by Lifestyle Type, International Journal of Environmental Research and Public Health , vol. 19, no. 22, p. 15027, Nov. 10.3390/ijerph192215027 Talebi Moghaddam M et al (Sep. 2024) Predicting diabetes in adults: identifying important features in unbalanced data over a 5-year cohort study using machine learning algorithm. BMC Med Res Methodol 24(1). 10.1186/s12874-024-02341-z Abnoosian K, Farnoosh R (2022) Prediction of Diabetes Disease Using an Ensemble of Machine Learning Multi-Classifier Models. SSRN Electron J. 10.2139/ssrn.4179050 Alagumariappan P et al (Mar. 2025) Optimized hybrid machine learning framework for early diabetes prediction using electrogastrograms. Sci Rep 15(1). 10.1038/s41598-025-93495-3 Costea NE, Moisi EV, Popescu DE, Comparison of Machine Learning Algorithms for Prediction of Diabetes, in (2021) 16th International Conference on Engineering of Modern Electric Systems (EMES) , IEEE, Jun. 2021, pp. 1–4. Accessed: Apr. 23, 2025. [Online]. Available: https://doi.org/10.1109/emes52337.2021.9484116 Tasin I, Nabil TU, Islam S, Khan R (2022) Diabetes prediction using machine learning and explainable AI techniques, Healthcare Technology Letters , vol. 10, no. 1–2, pp. 1–10, Dec. 10.1049/htl2.12039 Rachakonda SA, Pudipedi S, Angel TSS, PREDICTIVE MODELLING FOR DIABETES, USING MACHINE LEARNING (2024), INTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT , vol. 08, no. 008, pp. 1–16, Aug. 10.55041/ijsrem37149 Kumar K, Tomar A (2023) Diabetes Prediction System Using Machine Learning, in 2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT) , IEEE, Nov. pp. 286–291. Accessed: Apr. 23, 2025. [Online]. Available: https://doi.org/10.1109/icaiccit60255.2023.10466034 Ahamed BS, Arya MS, Sangeetha SKB, Auxilia Osvin NV (2022) Diabetes Mellitus Disease Prediction and Type Classification Involving Predictive Modeling Using Machine Learning Techniques and Classifiers, Applied Computational Intelligence and Soft Computing , vol. pp. 1–11, Dec. 2022. 10.1155/2022/7899364 Kuriakose SM, Basa Pati P, Singh T, Prediction of Diabetes Using Machine Learning: Analysis of 70,000 Clinical Database Patient Record, in (2022) 13th International Conference on Computing Communication and Networking Technologies (ICCCNT) , IEEE, Oct. 2022, pp. 1–5. Accessed: Apr. 23, 2025. [Online]. Available: https://doi.org/10.1109/icccnt54827.2022.9984264 Krishnamoorthi R et al (2022) A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques, Journal of Healthcare Engineering , vol. pp. 1–10, Jan. 2022. 10.1155/2022/1684017 Yamuna VYV, Chaitanya DU, sri B (2022) Y., & T.Jagadish Diabetes Disease Prediction By Using Machine Learning Algorithms., Semanticscholar , 2022 Lyngdoh AC, Choudhury NA, Moulik S, Diabetes Disease Prediction Using Machine Learning Algorithms, in (2020) IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES) , IEEE, Mar. 2021, pp. 517–521. Accessed: Apr. 23, 2025. [Online]. Available: https://doi.org/10.1109/iecbes48179.2021.9398759 Kaur H, Kumari V (2020) Predictive modelling and analytics for diabetes using a machine learning approach, Applied Computing and Informatics , vol. 18, no. 1/2, pp. 90–100, Jul. 10.1016/j.aci.2018.12.004 Sharma A (2021) Prediction of Diabetes Disease Using Machine Learning Model. Semanticscholar Kaliappan J et al (Aug. 2024) Analyzing classification and feature selection strategies for diabetes prediction across diverse diabetes datasets. Front Artif Intell 7. 10.3389/frai.2024.1421751 Abousaber I, Abdallah HF, El-Ghaish H (Jan. 2025) Robust predictive framework for diabetes classification using optimized machine learning on imbalanced datasets. Front Artif Intell 7. 10.3389/frai.2024.1499530 Halder N (2024) Exploring the Pima Indians Diabetes Dataset: Advanced Data Analysis Techniques in Python, Medium , Jan. 03, Accessed: Apr. 23, 2025. [Online]. Available: https://medium.com/@HalderNilimesh/exploring-the-pima-indians-diabetes-dataset-advanced-data-analysis-techniques-in-python-f02cba6f9f35 Erzurumlu AS Optimizing Healthcare Predictions with CatBoost: A Study on the Pima Indians Diabetes Dataset, LinkedIn , Sep. 16, 2024. Accessed: Apr. 23, 2025. [Online]. Available: https://www.linkedin.com/pulse/optimizing-healthcare-predictions-catboost-study-pima-erzurumlu-ipuvf/ Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7145782","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":487868317,"identity":"19553045-ca6c-4312-ad9d-29de6bde731c","order_by":0,"name":"Abdul Aamir Khan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA3UlEQVRIiWNgGAWjYBACCR4GxgcfKmx4+EG8hALitDAbzjiTJiPZANJiQJwWNmHetsM2BgdAXGK0SPYcfsY4g+0wj/H51YkfHhgwyPOLHcCvRZq3zezBB550HrMbbzdLAB1mOHN2An4tcvwM5oYzJKyBWs5uAGlJMLhNUAv7N2keA2Ye4xlnN/8gSos0b4+ZNE+CM48Bf+824myR7DlTbDjjQBqPxA3ebRYJBhKE/SJxJn3jg4//bOz5+89uvvmjwkaeX5qAFiTNYJUSxCoHAf4DpKgeBaNgFIyCkQQASWZCnCzdbIcAAAAASUVORK5CYII=","orcid":"https://orcid.org/0009-0005-8664-0379","institution":"Mandsaur university, Mandsaur M.P. India","correspondingAuthor":true,"prefix":"","firstName":"Abdul","middleName":"Aamir","lastName":"Khan","suffix":""},{"id":487870588,"identity":"0be2d10c-1631-4809-82c6-f83d89c89551","order_by":1,"name":"Dr. Bk Sharma","email":"","orcid":"","institution":"Mandsaur university, mandsaur M.P. India","correspondingAuthor":false,"prefix":"Dr.","firstName":"Bk","middleName":"","lastName":"Sharma","suffix":""}],"badges":[],"createdAt":"2025-07-17 07:07:51","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7145782/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7145782/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":87275851,"identity":"4947f733-e296-4b3d-b929-37ce35a7a900","added_by":"auto","created_at":"2025-07-22 09:03:39","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":63746,"visible":true,"origin":"","legend":"\u003cp\u003echallenges in diabetes prediction.\u003c/p\u003e","description":"","filename":"image1.png","url":"https://assets-eu.researchsquare.com/files/rs-7145782/v1/900c0085149b94630da092a2.png"},{"id":87275852,"identity":"b37c4fa2-0ecd-405c-9d55-3392adc1e975","added_by":"auto","created_at":"2025-07-22 09:03:39","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":69684,"visible":true,"origin":"","legend":"\u003cp\u003eStages of ML model\u003c/p\u003e","description":"","filename":"image2.png","url":"https://assets-eu.researchsquare.com/files/rs-7145782/v1/0cfbccaad42c044e53aaf8a4.png"},{"id":87275859,"identity":"dadb3301-4d82-4ab2-9580-a426269a994f","added_by":"auto","created_at":"2025-07-22 09:03:39","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":573631,"visible":true,"origin":"","legend":"\u003cp\u003eAlgorithms for diabetes prediction\u003c/p\u003e","description":"","filename":"image3.png","url":"https://assets-eu.researchsquare.com/files/rs-7145782/v1/016d3de854640e001e1d6d92.png"},{"id":87277093,"identity":"ec44a421-319f-42ee-a81a-8f7fdd3d9b15","added_by":"auto","created_at":"2025-07-22 09:11:39","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":108867,"visible":true,"origin":"","legend":"\u003cp\u003ePerformance Matrix of the Machine Learning Algorithm for Diabetes Prediction\u003c/p\u003e","description":"","filename":"image4.png","url":"https://assets-eu.researchsquare.com/files/rs-7145782/v1/9f36d4a9e98dfe22f8b72e5f.png"},{"id":87277935,"identity":"9c95309c-927b-4aca-85c6-f26bf669277b","added_by":"auto","created_at":"2025-07-22 09:19:42","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1074978,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7145782/v1/d563ccfe-f671-4f50-9b2d-bf809cdcfc22.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eComparative Study of Machine Learning Techniques for Diabetes Forecasting\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eDiabetes mellitus is a chronic metabolic disorder characterized by abnormally high blood glucose levels, which, if left unmanaged, can lead to severe complications. According to the World Health Organization (WHO), approximately 422\u0026nbsp;million people worldwide were living with diabetes in 2014, highlighting a growing public health concern. This rising prevalence underscores the urgent need for effective prediction and management strategies. Traditional diagnosis and management rely heavily on clinical assessments and laboratory tests, which can be time-consuming and sometimes cumbersome. Recent advances in machine learning (ML) have introduced innovative methods for healthcare data analysis and predictive modeling. By analyzing large datasets, machine learning algorithms can identify complex patterns and generate highly accurate predictions. This capability is particularly valuable for early detection and intervention in diabetes. Common datasets used in diabetes prediction research include the Pima Indians Diabetes Database and the UCI Diabetes Dataset. In recent years, many researchers have applied a variety of machine learning techniques to diabetes prediction, including logistic regression, decision trees, random forests, support vector machines, and neural networks. Studies utilizing the Pima Indians Diabetes Database have reported accuracy rates above 85% for models such as Random Forests and Support Vector Machines. Moreover, feature selection methods like recursive feature elimination and correlation analysis have been employed to improve model performance by identifying the most predictive features.\u003c/p\u003e\u003cp\u003eDespite promising results, variations in methodologies, evaluation metrics, and datasets across studies make direct comparisons challenging. The application of machine learning for diabetes prediction remains an evolving field, with emerging computational approaches aimed at improving the accuracy and effectiveness of risk assessment. With over 400\u0026nbsp;million people affected worldwide and a continuing rise in incidence, there is a critical need for novel approaches that can identify diabetes before clinical symptoms manifest. Researchers are focusing on combining machine learning algorithms with large-scale datasets to uncover key risk factors and enhance early detection techniques, ultimately improving patient outcomes and healthcare management. Advanced machine learning methods such as ensemble learning, deep learning, and semi-supervised learning are increasingly employed in this domain. These techniques leverage both labeled and unlabeled data to build robust, adaptable models tailored to individual patient characteristics. Key aspects of this approach include feature selection, handling data imbalances, and selecting appropriate evaluation metrics to ensure model reliability and effectiveness. Additionally, integrating lifestyle and demographic data provides a comprehensive understanding of diabetes risk. Despite its potential, the field faces significant challenges, particularly regarding data privacy and model interpretability. The reliance on sensitive patient data raises security and ethical concerns, prompting the adoption of privacy-preserving technologies such as federated learning and block chain. Furthermore, making machine learning models interpretable is essential to ensure their acceptance and effective use by healthcare professionals in clinical settings. Going forward, advancing diabetes prediction requires interdisciplinary collaboration and the integration of cutting-edge technologies. Ongoing research aims to improve model applicability, interpretability, and accuracy across diverse populations, thereby enhancing diabetes prevention and care. Collaboration among scientists, clinicians, and technology experts is critical to developing solutions that address both the complexities of diabetes and the evolving landscape of machine learning in healthcare.\u003c/p\u003e\u003cp\u003eAs illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, diabetes prediction faces several challenges, including data imbalance, interpretability, and model generalizability\u003c/p\u003e"},{"header":"Literature Review","content":"\u003cp\u003eThe study \u003cem\u003e\u0026ldquo;Predictive Modeling for Diabetes Using Machine Learning\u0026rdquo;\u003c/em\u003e (2024) by Shriya Aishani Rachakonda et al. explores the use of machine learning algorithms to forecast diabetes development based on diagnostic criteria. The work employs supervised classification methods including Random Forest, k-Nearest Neighbors (k-NN), Decision Trees, Support Vector Machines (SVM), and Logistic Regression. Among these, the Random Forest algorithm performed best, achieving 84% accuracy, 83% precision, 78% recall, an 80% F1-score, and an AUC of 0.86. The model was trained and tested using clinical data from the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) dataset [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].In \u003cem\u003eMachine Learning-Based Diabetes Prediction System\u003c/em\u003e (2023), Kundan Kumar et al. reported that machine learning methods such as Support Vector Machines, NumPy-based implementations, and Decision Trees can predict diabetes with 80% accuracy. The Support Vector Machine classifier specifically achieved an accuracy of 80%, demonstrating the effectiveness of machine learning in classifying diabetic and non-diabetic individuals [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. B. Ahamed et al. (2022) in their study \u003cem\u003e\u0026ldquo;Predictive Modeling Using Machine Learning Techniques and Classifiers for Diabetes Mellitus Disease Prediction and Type Classification\u0026rdquo;\u003c/em\u003e proposed several supervised learning techniques to create a reliable diabetes prediction model. These included Random Forest, k-Nearest Neighbors (k-NN), Decision Trees, Support Vector Machines (SVM), and Logistic Regression. The Random Forest algorithm again outperformed other classifiers, reaching 84% accuracy, 83% precision, 78% recall, 80% F1-score, and 0.86 AUC, evaluated on the Pima Indians Diabetes Dataset and survey data [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Sony M. et al. (2022), in their work \u003cem\u003e\u0026ldquo;Machine Learning-Based Diabetes Prediction: Examination of 70,000 Clinical Database Patient Records,\u0026rdquo;\u003c/em\u003e identified Random Forest as an effective method to precisely predict diabetes. The study highlighted the growing prevalence of diabetes worldwide, projecting that by 2040, approximately 642\u0026nbsp;million people will be affected. Their analysis was based on the Pima Indian Diabetes Dataset and the Diabetes 130-US Hospitals dataset from 1999\u0026ndash;2008 [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. R. Krishnamoorthi and colleagues (2022) proposed \u003cem\u003e\u0026ldquo;A New Framework for Predicting Diabetes in Healthcare Using Machine Learning Methods,\u0026rdquo;\u003c/em\u003e which achieved an accuracy of 83% using innovative machine learning techniques. The proposed model, which attained an accuracy score of 86%, offers valuable insights for researchers, stakeholders, students, and health professionals engaged in diabetes prediction research [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. V. Yamana (2022), in \u003cem\u003e\u0026ldquo;Predicting Diabetes Through Machine Learning Algorithms,\u0026rdquo;\u003c/em\u003e reported that machine learning algorithms can predict up to 90% of diabetes cases. The study examined issues such as model overfitting and underfitting to understand why some classifiers performed poorly, and achieved a consistent and optimal diabetes prediction accuracy of 90% [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. In 2021, Arwatki Chen Lyngdoh et al. evaluated five machine learning algorithms for diabetes prediction. The k-NN classifier achieved an accuracy of up to 76%. The study investigated the causes of inconsistent classifier performance by analyzing training and testing accuracy to detect overfitting and underfitting. The dataset included risk factors and outcomes related to diabetes [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. Kaur Harleen et al. (2021), in their study \u003cem\u003e\u0026ldquo;Machine Learning-Based Predictive Modeling and Analytics for Diabetes,\u0026rdquo;\u003c/em\u003e developed and evaluated five machine learning models\u0026mdash;including SVM, k-NN, ANN, and MDR\u0026mdash;using the Pima Indian diabetes dataset. Their findings showed that the SVM-linear model achieved the highest accuracy (0.89) and precision (0.88), while the k-NN model had the best recall and F1-score (0.90 and 0.88, respectively). The AUC values for these models were 0.90 and 0.92 [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Sharma Amandeep et al. (2021) proposed a machine learning model combining logistic regression, ANN, Na\u0026iuml;ve Bayes, and decision trees for diabetes prediction. The decision tree model achieved an accuracy of 76.52%, Na\u0026iuml;ve Bayes 76.95%, logistic regression 80.43%, and the artificial neural network classifier achieved 75.21% accuracy on the Pima Indian diabetes dataset [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e summarizes existing studies on diabetes prediction models, including datasets used and accuracy metrics.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eOverview of Existing Studies on Diabetes Prediction Models:\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"5\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePaper\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eAbstract summary\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMain Findings\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAccuracy\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eDataset\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePredictive Modelling For Diabetes Using Machine Learning,\u003c/p\u003e\u003cp\u003e2024, Shriya Aishani Rachakonda et. al. [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eThe paper applies machine learning based algorithms to predict the occurrence of diabetes by using diagnostic characteristics.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003ethe study employed\u003c/p\u003e\u003cp\u003esupervised learning classification techniques, such as Random Forest, k-Nearest Neighbours (k-NN), Decision Trees, Support Vector Machines (SVM), and Logistic Regression.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eWith an accuracy of 84%, precision of 83%, recall of 78%, F1-score of 80%, and AUC of 0.86, the Random Forest algorithm performed the best.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003ethe National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) dataset.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDiabetes Prediction System Using Machine Learning, 2023, Kundan Kumar et.al [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDiabetes may be predicted with 80% accuracy using machine learning techniques like Decision Trees, Numpy, and Support Vector Machine.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDiabetes was predicted by the Support Vector Machine Classifier with 80% accuracy.\u003c/p\u003e\u003cp\u003ePeople with and without diabetes can be efficiently categorised using machine learning modelling.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e80%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNot mentioned\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDiabetes Mellitus Disease Prediction and Type Classification Involving Predictive Modeling Using Machine Learning Techniques and Classifiers, 2022, B. Ahamed et. Al. [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eThe study offers machine learning methods for accurately predicting and categorising diabetes mellitus.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eA reliable predictive model of diabetes for this particular study was made using several supervised learning classification methods: Logistic Regression, Support Vector Machines (SVM), Decision Trees, k-Nearest Neighbours (k-NN), Random Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eThe highest performance was achieved by the Random Forest algorithm with accuracy of 84%, precision\u0026thinsp;=\u0026thinsp;83%, recall\u0026thinsp;=\u0026thinsp;78%, F1-score\u0026thinsp;=\u0026thinsp;80%, and AUC\u0026thinsp;=\u0026thinsp;0.86.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003ethe Pima Indians Diabetes Dataset, and survey.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePrediction of Diabetes Using Machine Learning: Analysis of 70,000 Clinical Database Patient Record, 2022, Sony M et.al. [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eRandom forest is one of the machine learning algorithms that can be used to accurately predict diabetes.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe number of patients suffering from diabetes is expected to reach 642\u0026nbsp;million globally by 2040, indicating the growing prevalence of the disease.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eDiabetes 130-US hospitals for the years 1999\u0026ndash;2008 Data Set and\u003c/p\u003e\u003cp\u003ethe Pima Indian Diabetes Dataset.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eA Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques, 2022 R. Krishnamoorthi et.al [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eThe study suggests an 83% accurate paradigm for diabetes prediction based on clever machine learning.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eHealth professionals, stakeholders, students, and academics working on diabetes prediction research and development can all benefit from the study's conclusions.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eThe accuracy score of the suggested machine learning-based architecture was 86%.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eSurvey\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDiabetes Disease Prediction By Using Machine Learning Algorithms,2022, V.Yamuna. [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eUp to 90% of diabetes cases can be predicted using machine learning algorithms.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eThe study looked at model overfitting and underfitting to determine why some machine learning classifiers performed poorly.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eThe study used machine learning algorithms to predict diabetes with a consistent and best accuracy of 90%.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003eNot mentioned\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDiabetes Disease Prediction Using Machine Learning Algorithms,2021, Arwatki Chen Lyngdoh et.al. [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eThe study assesses five machine learning algorithms for diabetic illness prediction; the KNN classifier achieves an accuracy of up to 76%.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eBy analysing training and testing accuracy and searching for indications of overfitting or underfitting, the study determined why certain classifiers failed to reach consistent and high accuracy.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eAchieved the accuracy of 76%\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003ediabetes dataset containing information on risk factors and outcomes related to diabetes.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePredictive modelling and analytics for diabetes using a machine learning approach,2021, Harleen Kaur et.al. [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eThe research develops and assesses five alternative machine learning models to predict and categorise diabetes in the Pima Indian diabetes dataset.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eIn order to categorise patients as either diabetes or non-diabetic, the study created and examined five distinct machine learning models: SVM, k-NN, ANN, and MDR.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003eSVM-linear model provides best accuracy of 0.89 and precision of 0.88, k-NN model provided best recall and F1scoreof0.90and0.88,\u003c/p\u003e\u003cp\u003eAUC value of SVM- linear and k-NN model are 0.90 and 0.92\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003ethe Pima Indian diabetes dataset,\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ePrediction of Diabetes Disease Using Machine Learning Model,2021, Amandeep Sharma et.al. [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eThe research proposes a machine learning model to predict diabetes utilising methods including decision tree, Na\u0026iuml;ve Bayes, ANN, and logistic regression.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eUsing supervised machine learning algorithms such as logistic regression, Na\u0026iuml;ve Bayes, decision trees, and artificial neural networks, the study created a model for diabetes prediction.\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003elogistic regression displays 80.43% accuracy,\u003c/p\u003e\u003cp\u003eNa\u0026iuml;ve Bayes algorithm is 76.95%,\u003c/p\u003e\u003cp\u003edecision tree algorithm has an accuracy of 76.52%,\u003c/p\u003e\u003cp\u003eArtificial neural network classifier has 75.21% accuracy\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003ethe Pima Indian diabetes dataset,\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eResearch Gap\u003c/b\u003e\u003c/p\u003e\u003cp\u003eDespite significant advances in applying machine learning to diabetes prediction, several gaps remain in the literature. First, the datasets used lack diversity; most studies rely heavily on the Pima Indians Diabetes Database and similar datasets. These datasets represent only a subset of the diabetic population and may not accurately reflect the broader, more diverse global population. Consequently, the generalizability of the developed models is questionable. Additionally, although many algorithms achieve high accuracy, there is limited emphasis on the interpretability of the models and their clinical applicability. Moreover, the validation methods employed in many studies are not sufficiently robust, often relying on small sample sizes or lacking external validation. This can lead to over fitting and reduce the model\u0026rsquo;s ability to generalize effectively in real-world scenarios.\u003c/p\u003e"},{"header":"Methodology","content":"\u003cp\u003eThe approaches used for diabetes prediction\u0026mdash;and consequently the methodologies employed in the reviewed studies\u0026mdash;vary significantly. Common practices typically include data preprocessing steps such as normalization, handling missing values, and encoding categorical variables. Dimensionality reduction techniques, including feature selection and Principal Component Analysis (PCA), are frequently applied to enhance model performance. Additionally, various statistical tests may be used to assist in feature selection. Studies also differ in their choice of machine learning algorithms, with some comparing multiple algorithms to identify the best model for their dataset. Model performance is generally evaluated using metrics such as accuracy, precision, recall, and the area under the receiver operating characteristic curve (AUC-ROC), providing insight into the strengths and weaknesses of each algorithm.\u003c/p\u003e\u003cp\u003eThe main steps of the suggested methodology include:\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eData Cleaning\u003c/strong\u003e\u003cp\u003eAddressing missing values and ensuring correct data types to enable efficient analysis.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eData Processing\u003c/strong\u003e\u003cp\u003eThis is a critical step when using datasets like the Pima Indians Diabetes Database, where effective management of missing values is essential. Key features considered in this study include blood pressure, skin thickness, insulin, glucose, and body mass index.\u003c/p\u003e\u003c/p\u003e\u003cp\u003e\u003cstrong\u003eNormalization and Standardization\u003c/strong\u003e\u003cp\u003eScaling the data to ensure that every feature contributes equally to the model\u0026rsquo;s performance. Model accuracy varies significantly across studies. For example, one model achieved an accuracy of 93.22% on the Pima Indians dataset and 98.95% on the Mendeley dataset. Generally, an accuracy of around 78% is considered good, while 81% or higher is regarded as exceptional\u0026mdash;especially when validated using 10-fold cross-validation.\u003c/p\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e illustrates the standard methodology, highlighting key steps such as data preprocessing, feature selection, model training, and evaluation.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eIn Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, the standard methodology used in studies predicting diabetes with machine learning is depicted as a flowchart. It outlines critical steps leading to the final prediction, including data preprocessing, dimensionality reduction, algorithm selection, model training, and evaluation. This systematic approach helps improve both the accuracy and reliability of the predictive models. The machine learning-based methodology for diabetes prediction involves several algorithms that work with both labeled and unlabeled data. Special emphasis is placed on semi-supervised learning techniques, which are particularly beneficial when obtaining labeled data is costly or limited. By incorporating unlabeled data alongside labeled instances, these methods enhance learning efficiency and enable training on larger datasets. For example, integrating significant amounts of unlabeled data using Laplacian SVM (LapSVM) has been shown to improve both accuracy and generalization capabilities [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e].The literature review, which analyzed a carefully selected dataset of 2,351 publications, was structured following the standards established by Marcus et al. [45]. A systematic approach using Term Frequency-Inverse Document Frequency (TF-IDF) was employed for text analysis [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e][\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. To maintain objectivity and reduce bias, a four-point scale was developed for both qualitative and quantitative evaluations. Lifestyle factors and demographic information emerged as important predictors in diabetes risk. Studies highlighted age and gender as significant variables affecting metabolic health and insulin sensitivity [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Additionally, models incorporated modifiable lifestyle characteristics such as physical activity and smoking, emphasizing the need for dynamic prediction models that adapt to changing health behaviors [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Increasingly, researchers are combining data from multiple sources\u0026mdash;such as lifestyle and genetic information\u0026mdash;to create more comprehensive risk assessments [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e].Data imbalance, a common challenge in classification problems, was addressed using various strategies aimed at rebalancing the dataset distribution. Techniques like cost-sensitive learning, ensemble methods, and re-sampling were applied to improve model generalization and performance [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e][\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. These approaches are vital to ensuring fair representation of minority groups in predictions and mitigating biases caused by uneven sample sizes across categories [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e].Model performance was assessed using key metrics such as accuracy, precision, recall, F1 score, and the area under the receiver operating characteristic curve (AUC). Together, these metrics provide a comprehensive evaluation of the model\u0026rsquo;s effectiveness. Precision and recall offer insight into the model\u0026rsquo;s ability to correctly identify positive cases, while accuracy measures the overall proportion of correct predictions [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e][\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e].One of the most widely used datasets in diabetes prediction research is the Pima Indians Diabetes Database. It contains medical diagnostic data for 768 female patients of Pima Indian heritage aged 21 or older. The dataset includes nine attributes relevant to diabetes classification\u0026mdash;eight predictors and one target variable. Its extensive use in developing and evaluating machine learning algorithms has made it a benchmark dataset in this field [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e].The dataset comprises numerous clinical and biometric parameters necessary for diabetes diagnosis, including blood pressure, age, body mass index (BMI), and glucose levels. The objective is to use these features to diagnose and predict diabetes status [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Notably, the dataset is imbalanced, with 500 negative (non-diabetic) cases and 268 positive (diabetic) cases, posing challenges for predictive modeling due to potential biases in the results.\u003c/p\u003e\u003cp\u003eFinally, an overview of the main machine learning techniques frequently employed for diabetes prediction is provided.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eAs shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, machine learning algorithms are widely used for diabetes prediction because they can identify complex patterns within clinical data. Logistic Regression (LR) is a popular statistical model for binary classification, estimating the probability that a given input belongs to a particular class, such as diabetic or non-diabetic. k-Nearest Neighbors (k-NN) is a simple yet effective instance-based algorithm that classifies new data points based on the majority class among their closest neighbors. Decision Trees (DT) build a tree structure where internal nodes represent decisions, branches represent possible outcomes, and leaves correspond to class labels. Random Forest (RF) improves accuracy and reduces over fitting by combining multiple decision trees into an ensemble method. In high-dimensional spaces, Support Vector Machines (SVM) find the hyper plane that maximizes the margin separating classes. Naive Bayes (NB), based on Bayes\u0026rsquo; theorem and assuming feature independence, is simple but often achieves strong performance. Ensemble techniques such as Gradient Boosting Machines (GBM) and XGBoost sequentially build models, with each new model correcting errors made by the previous one, making them particularly effective for structured data. Deep Learning (DL), utilizing artificial neural networks with multiple layers, excels at capturing complex relationships in large datasets and has demonstrated high accuracy in diabetes prediction.\u003c/p\u003e\u003cp\u003eTo thoroughly evaluate model performance, metrics such as Accuracy, Precision, Recall, and F-measure, derived from the confusion matrix, are commonly used. Additionally, more advanced metrics like Sensitivity, Specificity, Area Under the ROC Curve (AUC), and Matthews Correlation Coefficient (MCC) provide deeper insight into model behavior. These diverse metrics help balance trade-offs between different types of errors (e.g., false positives vs. false negatives), which is crucial for selecting the best model for practical and clinical applications in diabetes prediction. Given the data-driven nature of machine learning, access to diverse and comprehensive datasets is vital to improving model performance. Future research should focus on collecting extensive data from varied demographics, considering multiple factors such as medical conditions and lifestyle choices that influence diabetes risk. Collaborative sharing of anonym zed patient data across institutions will facilitate the development of more generalizable models suitable for a broad range of populations.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eDrawing insights from a range of recent studies, this review critically examines the application of various machine learning (ML) algorithms in diagnosing diabetes. It is clear that ML techniques such as Random Forest, Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), Logistic Regression, and Deep Learning are highly effective for prediction. These models, when trained on structured clinical datasets like the Pima Indians Diabetes Database, commonly achieve high accuracy, precision, and AUC scores. Notably, ensemble and deep learning models have consistently outperformed traditional classifiers in predictive ability. Despite these promising results, several research limitations persist. A key limitation is the lack of generalizability, primarily due to the reliance on benchmark datasets that are rarely diverse. Additionally, interpretability\u0026mdash;which is crucial for clinical acceptance\u0026mdash;is generally under-addressed. Moreover, relatively few studies use external validation strategies to ensure the models\u0026rsquo; robustness across different populations and clinical settings; this vulnerability remains largely unaddressed in most current publications. As shown in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, different ML models report varying accuracy levels across datasets, with Random Forest and k-NN demonstrating particularly promising results.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eTable summarizing the algorithms, datasets used, and reported accuracy rates\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAlgorithm(s)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDataset\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eAccuracy (%)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest, k-NN, Decision Trees, SVM, Logistic Regression\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNIDDK Dataset\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e84%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSVM, Numpy, Decision Trees\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNot Specified\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e80%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest, k-NN, Decision Trees, SVM, Logistic Regression\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ePima Indians Diabetes Dataset\u0026thinsp;+\u0026thinsp;Survey\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e84%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eRandom Forest\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ePima Indians Dataset, Diabetes 130-US hospitals (1999\u0026ndash;2008)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eNot Specified\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNot specified (general ML-based architecture)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNot Specified\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e83%, proposed model: 86%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNot specified\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eNot Specified\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e90%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ek-NN and others\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003eDiabetes dataset with risk factors and consequences\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e76% (k-NN)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSVM (Linear), k-NN, ANN, MDR\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ePima Indian Diabetes Dataset\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eSVM: 89%, k-NN: 90%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eLogistic Regression, ANN, Na\u0026iuml;ve Bayes, Decision Trees\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003ePima Indian Diabetes Dataset\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eLR: 80.43%, NB: 76.95%, DT: 76.52%, ANN: 75.21%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThe figure illustrates key performance metrics\u0026mdash;Accuracy, Precision, Recall, and F1-Score\u0026mdash;used to visually compare different machine learning algorithms for diabetes prediction.\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e\u003cb\u003eFuture Scope\u003c/b\u003e\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eTo enhance the clinical applicability and relevance of machine learning (ML) models for diabetes prediction, future research must address several key areas. First and foremost, dataset diversification is essential. Future studies should leverage large, multi-center, real-world datasets\u0026mdash;such as electronic health records and longitudinal data\u0026mdash;to improve model generalizability and reduce biases associated with limited or homogeneous datasets. Interpretability is another critical aspect. The adoption of explainable AI (XAI) techniques will be vital in making ML predictions transparent and understandable, which is crucial for gaining clinicians\u0026rsquo; trust and promoting the clinical adoption of these models. Moreover, more rigorous validation methods are necessary to ensure model robustness and generalizability. This includes employing stringent validation protocols, such as external dataset validation and k-fold cross-validation, which help minimize over fitting and provide a more accurate evaluation of model performance. Finally, integrating these ML platforms seamlessly into clinical systems should be prioritized. Future tools must be designed for smooth incorporation into existing electronic health record systems and clinical decision support systems, enabling real-time risk stratification and diagnosis aligned with current clinical workflows.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eChou C-Y, Hsu D-Y, Chou C-H (2023) Predicting the onset of diabetes with machine learning methods, \u003cem\u003eJournal of Personalized Medicine\u003c/em\u003e, vol. 13, no. 3, p. 406, Feb. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3390/jpm13030406\u003c/span\u003e\u003cspan address=\"10.3390/jpm13030406\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKalla D, Smith N, Samaah F, Polimetla K (Mar. 2022) Enhancing early diagnosis: Machine learning applications in diabetes prediction. J Artif Intell Cloud Comput 1\u0026ndash;7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.47363/jaicc/2022(1)191\u003c/span\u003e\u003cspan address=\"10.47363/jaicc/2022(1)191\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eQin Y et al (2022) Machine Learning Models for Data-Driven Prediction of Diabetes by Lifestyle Type, \u003cem\u003eInternational Journal of Environmental Research and Public Health\u003c/em\u003e, vol. 19, no. 22, p. 15027, Nov. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3390/ijerph192215027\u003c/span\u003e\u003cspan address=\"10.3390/ijerph192215027\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTalebi Moghaddam M et al (Sep. 2024) Predicting diabetes in adults: identifying important features in unbalanced data over a 5-year cohort study using machine learning algorithm. BMC Med Res Methodol 24(1). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12874-024-02341-z\u003c/span\u003e\u003cspan address=\"10.1186/s12874-024-02341-z\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAbnoosian K, Farnoosh R (2022) Prediction of Diabetes Disease Using an Ensemble of Machine Learning Multi-Classifier Models. SSRN Electron J. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.2139/ssrn.4179050\u003c/span\u003e\u003cspan address=\"10.2139/ssrn.4179050\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAlagumariappan P et al (Mar. 2025) Optimized hybrid machine learning framework for early diabetes prediction using electrogastrograms. Sci Rep 15(1). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41598-025-93495-3\u003c/span\u003e\u003cspan address=\"10.1038/s41598-025-93495-3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCostea NE, Moisi EV, Popescu DE, Comparison of Machine Learning Algorithms for Prediction of Diabetes, in (2021) \u003cem\u003e16th International Conference on Engineering of Modern Electric\u003c/em\u003e Systems \u003cem\u003e(EMES)\u003c/em\u003e, IEEE, Jun. 2021, pp. 1\u0026ndash;4. Accessed: Apr. 23, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/emes52337.2021.9484116\u003c/span\u003e\u003cspan address=\"10.1109/emes52337.2021.9484116\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTasin I, Nabil TU, Islam S, Khan R (2022) Diabetes prediction using machine learning and explainable AI techniques, \u003cem\u003eHealthcare Technology Letters\u003c/em\u003e, vol. 10, no. 1\u0026ndash;2, pp. 1\u0026ndash;10, Dec. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1049/htl2.12039\u003c/span\u003e\u003cspan address=\"10.1049/htl2.12039\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRachakonda SA, Pudipedi S, Angel TSS, PREDICTIVE MODELLING FOR DIABETES, USING MACHINE LEARNING (2024), \u003cem\u003eINTERANTIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT\u003c/em\u003e, vol. 08, no. 008, pp. 1\u0026ndash;16, Aug. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.55041/ijsrem37149\u003c/span\u003e\u003cspan address=\"10.55041/ijsrem37149\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKumar K, Tomar A (2023) Diabetes Prediction System Using Machine Learning, in \u003cem\u003e2023 International Conference on Advances in Computation, Communication and Information Technology (ICAICCIT)\u003c/em\u003e, IEEE, Nov. pp. 286\u0026ndash;291. Accessed: Apr. 23, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/icaiccit60255.2023.10466034\u003c/span\u003e\u003cspan address=\"10.1109/icaiccit60255.2023.10466034\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAhamed BS, Arya MS, Sangeetha SKB, Auxilia Osvin NV (2022) Diabetes Mellitus Disease Prediction and Type Classification Involving Predictive Modeling Using Machine Learning Techniques and Classifiers, \u003cem\u003eApplied Computational Intelligence and Soft Computing\u003c/em\u003e, vol. pp. 1\u0026ndash;11, Dec. 2022. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1155/2022/7899364\u003c/span\u003e\u003cspan address=\"10.1155/2022/7899364\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKuriakose SM, Basa Pati P, Singh T, Prediction of Diabetes Using Machine Learning: Analysis of 70,000 Clinical Database Patient Record, in (2022) \u003cem\u003e13th International Conference on Computing Communication and Networking\u003c/em\u003e Technologies \u003cem\u003e(ICCCNT)\u003c/em\u003e, IEEE, Oct. 2022, pp. 1\u0026ndash;5. Accessed: Apr. 23, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/icccnt54827.2022.9984264\u003c/span\u003e\u003cspan address=\"10.1109/icccnt54827.2022.9984264\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKrishnamoorthi R et al (2022) A Novel Diabetes Healthcare Disease Prediction Framework Using Machine Learning Techniques, \u003cem\u003eJournal of Healthcare Engineering\u003c/em\u003e, vol. pp. 1\u0026ndash;10, Jan. 2022. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1155/2022/1684017\u003c/span\u003e\u003cspan address=\"10.1155/2022/1684017\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYamuna VYV, Chaitanya DU, sri B (2022) Y., \u0026amp; T.Jagadish Diabetes Disease Prediction By Using Machine Learning Algorithms., \u003cem\u003eSemanticscholar\u003c/em\u003e, 2022\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLyngdoh AC, Choudhury NA, Moulik S, Diabetes Disease Prediction Using Machine Learning Algorithms, in (2020) \u003cem\u003eIEEE-EMBS Conference on Biomedical Engineering and\u003c/em\u003e Sciences \u003cem\u003e(IECBES)\u003c/em\u003e, IEEE, Mar. 2021, pp. 517\u0026ndash;521. Accessed: Apr. 23, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/iecbes48179.2021.9398759\u003c/span\u003e\u003cspan address=\"10.1109/iecbes48179.2021.9398759\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKaur H, Kumari V (2020) Predictive modelling and analytics for diabetes using a machine learning approach, \u003cem\u003eApplied Computing and Informatics\u003c/em\u003e, vol. 18, no. 1/2, pp. 90\u0026ndash;100, Jul. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.aci.2018.12.004\u003c/span\u003e\u003cspan address=\"10.1016/j.aci.2018.12.004\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSharma A (2021) Prediction of Diabetes Disease Using Machine Learning Model. Semanticscholar\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKaliappan J et al (Aug. 2024) Analyzing classification and feature selection strategies for diabetes prediction across diverse diabetes datasets. Front Artif Intell 7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3389/frai.2024.1421751\u003c/span\u003e\u003cspan address=\"10.3389/frai.2024.1421751\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAbousaber I, Abdallah HF, El-Ghaish H (Jan. 2025) Robust predictive framework for diabetes classification using optimized machine learning on imbalanced datasets. Front Artif Intell 7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.3389/frai.2024.1499530\u003c/span\u003e\u003cspan address=\"10.3389/frai.2024.1499530\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHalder N (2024) Exploring the Pima Indians Diabetes Dataset: Advanced Data Analysis Techniques in Python, \u003cem\u003eMedium\u003c/em\u003e, Jan. 03, Accessed: Apr. 23, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://medium.com/@HalderNilimesh/exploring-the-pima-indians-diabetes-dataset-advanced-data-analysis-techniques-in-python-f02cba6f9f35\u003c/span\u003e\u003cspan address=\"https://medium.com/@HalderNilimesh/exploring-the-pima-indians-diabetes-dataset-advanced-data-analysis-techniques-in-python-f02cba6f9f35\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eErzurumlu AS Optimizing Healthcare Predictions with CatBoost: A Study on the Pima Indians Diabetes Dataset, \u003cem\u003eLinkedIn\u003c/em\u003e, Sep. 16, 2024. Accessed: Apr. 23, 2025. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.linkedin.com/pulse/optimizing-healthcare-predictions-catboost-study-pima-erzurumlu-ipuvf/\u003c/span\u003e\u003cspan address=\"https://www.linkedin.com/pulse/optimizing-healthcare-predictions-catboost-study-pima-erzurumlu-ipuvf/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Mandsaur University, Mandsaur M.P. India","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Clinical Data, Diabetes Prediction, Healthcare, Logistic Regression, Machine Learning","lastPublishedDoi":"10.21203/rs.3.rs-7145782/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7145782/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe rising global prevalence of diabetes has intensified the need for accurate and early diagnostic systems. As a significant global health concern, diabetes requires effective and precise prediction techniques. This study reviews research that utilizes clinical data and machine learning (ML) approaches for diabetes prediction. Common pre-processing steps include categorical data encoding, handling missing values, and normalization. To enhance model performance, dimensionality reduction techniques such as Principal Component Analysis (PCA) and feature selection are employed. Performance metrics\u0026mdash;such as accuracy, precision, recall, F1-score, and AUC-ROC\u0026mdash;are used to evaluate and compare various supervised learning algorithms, including Random Forest, Support Vector Machines (SVM), k-Nearest Neighbors (k-NN), Logistic Regression, and Decision Trees. Many studies use small datasets, which limits generalizability despite reporting high accuracy. This study underscores the need for diverse datasets and clinically interpretable models, while also highlighting gaps in model interpretability and validation practices.\u003c/p\u003e","manuscriptTitle":"Comparative Study of Machine Learning Techniques for Diabetes Forecasting","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-07-22 09:03:35","doi":"10.21203/rs.3.rs-7145782/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a8a4a106-0cc7-442e-9edb-8c3269b3228a","owner":[],"postedDate":"July 22nd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":51795566,"name":"Medical Genetics"}],"tags":[],"updatedAt":"2025-07-22T09:03:35+00:00","versionOfRecord":[],"versionCreatedAt":"2025-07-22 09:03:35","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7145782","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7145782","identity":"rs-7145782","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00