Predicting 30-Day Hospital Readmission in Diabetic Patients: A Random Forest Approach with Cluster-Aware Bootstrap Evaluation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Method Article Predicting 30-Day Hospital Readmission in Diabetic Patients: A Random Forest Approach with Cluster-Aware Bootstrap Evaluation George Xu, Syed Hamzah Rizvi This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9657911/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background: Thirty-day hospital readmission in diabetic patients is a clinically significant outcome associated with poor glycemic control and preventable care gaps. Machine learning models applied to electronic health record (EHR) data have shown promise for identifying high-risk patients. A methodological issue, however, pervades published evaluations: when datasets contain multiple encounters per patient, standard bootstrap confidence intervals (CIs) applied at the row level violate the independence assumption and systematically underestimate uncertainty. Methods: We applied a Random Forest (RF) classifier to the UCI Diabetes 130-US Hospitals dataset (101,766 encounters; 71,518 unique patients) to predict 30-day readmission. Model performance was assessed using 5-fold stratified cross validation. We computed 95% CIs for AUROC using both iid row-level bootstrap and a cluster-aware bootstrap resampling at the patient level, implemented in the open-source mcguard library. A logistic regression (LR) baseline was included for comparison. Feature importance was assessed via mean decrease in impurity (MDI). Results: The RF model achieved AUROC = 0.683 (95% cluster-aware CI: [0.671, 0.691]), AUPRC = 0.312, and F1 = 0.291 on the first-encounter analysis set (n = 68,841 after preprocessing). The iid bootstrap produced a CI of [0.678, 0.684] - 3.3x narrower - demonstrating how much standard evaluation tends to inflate accuracy measures, an error of such magnitude as to have significant implications for assessing the level of clinical utility in terms of discrimination thresholds. This gap is sufficient to shift conclusions about whether the model meets discrimination thresholds for clinical deployment. Prior inpatient visits, number of medications, and number of diagnoses were the strongest predictors. Conclusions: RF models provide modest but consistent discriminative ability for 30-day diabetes readmission. More critically, evaluation on datasets with repeated patient records requires cluster-aware uncertainty quantification; row level bootstrap produces CIs that are statistically invalid and artificially narrow, with practical consequences for clinical and regulatory conclusions. Computational Biology diabetes hospital readmission random forest machine learning bootstrap cluster-aware evaluation electronic health records Full Text Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9657911","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Method Article","associatedPublications":[],"authors":[{"id":637109432,"identity":"bee3c4bd-3da3-4833-8171-f3321a444f03","order_by":0,"name":"George Xu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/UlEQVRIiWNgGAWjYDAD+/mHDz74AGSwsROrxUCCLdlwBkgLM/FaeMykeUAsQlrk3ZuPffxScThxu3RbsrHNr23yfMwMjB8+5uDWYnjmWPJsmTNpiTvnHD74OLfvtmEbMwOz5MxteLTMyDFmlmyzSWw4kJZsnNtzmxGohY2ZF5+W+W+AWv5JALXkmElb9ty2J6hFXoLHmPFjg03ihhtALQw/bicS1GLAk5bMzHAszXhmz7Fkw96G28ltzIzNeP0i3374MOOPmsOy/ezNBx/8+HPbdn5788EPH/HZcgAYETwwHmMbmGzArR5kC1Ca8Qec+wev4lEwCkbBKBihAAC10FTCBLo+eQAAAABJRU5ErkJggg==","orcid":"","institution":"Inglemoor High School","correspondingAuthor":true,"prefix":"","firstName":"George","middleName":"","lastName":"Xu","suffix":""},{"id":637109433,"identity":"11de466d-c557-4d53-aaf5-63633c542cd5","order_by":1,"name":"Syed Hamzah Rizvi","email":"","orcid":"","institution":"West Lafayette Jr/Sr. High School","correspondingAuthor":false,"prefix":"","firstName":"Syed","middleName":"Hamzah","lastName":"Rizvi","suffix":""}],"badges":[],"createdAt":"2026-05-08 20:50:57","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-9657911/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9657911/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":109050866,"identity":"7a008c38-04a3-4c7e-8b8a-db190cc90742","added_by":"auto","created_at":"2026-05-12 06:44:19","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":119498,"visible":true,"origin":"","legend":"","description":"","filename":"Diabeties.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9657911/v1_covered_88490bba-5489-44f8-aac3-a296d62b65d6.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003ePredicting 30-Day Hospital Readmission in Diabetic Patients: A Random Forest Approach with Cluster-Aware Bootstrap Evaluation\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Independent Research","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"diabetes, hospital readmission, random forest, machine learning, bootstrap, cluster-aware evaluation, electronic health records","lastPublishedDoi":"10.21203/rs.3.rs-9657911/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9657911/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eBackground: Thirty-day hospital readmission in diabetic patients is a clinically significant outcome associated with poor glycemic control and preventable care gaps. Machine learning models applied to electronic health record (EHR) data have shown promise for identifying high-risk patients. A methodological issue, however, pervades published evaluations: when datasets contain multiple encounters per patient, standard bootstrap confidence intervals (CIs) applied at the row level violate the independence assumption and systematically underestimate uncertainty.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eMethods: We applied a Random Forest (RF) classifier to the UCI Diabetes 130-US Hospitals dataset (101,766 encounters; 71,518 unique patients) to predict 30-day readmission. Model performance was assessed using 5-fold stratified cross validation. We computed 95% CIs for AUROC using both iid row-level bootstrap and a cluster-aware bootstrap resampling at the patient level, implemented in the open-source mcguard library. A logistic regression (LR) baseline was included for comparison. Feature importance was assessed via mean decrease in impurity (MDI).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eResults: The RF model achieved AUROC = 0.683 (95% cluster-aware CI: [0.671, 0.691]), AUPRC = 0.312, and F1 = 0.291 on the first-encounter analysis set (n = 68,841 after preprocessing). The iid bootstrap produced a CI of [0.678, 0.684] - 3.3x narrower - demonstrating how much standard evaluation tends to inflate accuracy measures, an error of such magnitude as to have significant implications for assessing the level of clinical utility in terms of discrimination thresholds. This gap is sufficient to shift conclusions about whether the model meets discrimination thresholds for clinical deployment. Prior inpatient visits, number of medications, and number of diagnoses were the strongest predictors.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eConclusions: RF models provide modest but consistent discriminative ability for 30-day diabetes readmission. More critically, evaluation on datasets with repeated patient records requires cluster-aware uncertainty quantification; row level bootstrap produces CIs that are statistically invalid and artificially narrow, with practical consequences for clinical and regulatory conclusions.\u003c/p\u003e","manuscriptTitle":"Predicting 30-Day Hospital Readmission in Diabetic Patients: A Random Forest Approach with Cluster-Aware Bootstrap Evaluation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-12 06:40:25","doi":"10.21203/rs.3.rs-9657911/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"488b8c23-d8f7-445c-9dcf-48c082748ead","owner":[],"postedDate":"May 12th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":67817214,"name":"Computational Biology"}],"tags":[],"updatedAt":"2026-05-12T06:40:26+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-12 06:40:25","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9657911","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9657911","identity":"rs-9657911","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.