Comparative evaluation of ChatGPT-4o and ChatGPT-5.1 for automated LUTS symptom scoring from clinician-documented narratives

doi:10.21203/rs.3.rs-8939728/v1

Comparative evaluation of ChatGPT-4o and ChatGPT-5.1 for automated LUTS symptom scoring from clinician-documented narratives

2026 · doi:10.21203/rs.3.rs-8939728/v1

preprint OA: closed

Full text JSON View at publisher

Full text 88,463 characters · extracted from preprint-html · click to expand

Comparative evaluation of ChatGPT-4o and ChatGPT-5.1 for automated LUTS symptom scoring from clinician-documented narratives | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Comparative evaluation of ChatGPT-4o and ChatGPT-5.1 for automated LUTS symptom scoring from clinician-documented narratives Jie Hyeon Lee, Jung Hoon Lee, Sang Jun Yoo, Min Chul Cho, Hyeon Jeong, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8939728/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 8 You are reading this latest preprint version Abstract Background Lower urinary tract symptoms (LUTS) are commonly quantified using validated instruments such as the International Prostate Symptom Score (IPSS) and the Overactive Bladder Symptom Score (OABSS). However, structured questionnaires are not consistently administered in routine outpatient practice, where clinicians frequently rely on narrative documentation. While generative large language models (LLMs) have demonstrated potential in clinical natural language processing, their performance stability across successive model generations for automated symptom scoring remains unclear. We conducted a head-to-head comparison of ChatGPT-4o and ChatGPT-5.1 to evaluate generational performance differences in automated LUTS scoring. Materials and Methods This retrospective single-center study included 91 patients presenting with LUTS or overactive bladder (OAB)-related complaints between April and June 2025. Free-text symptom narratives documented in electronic medical records were directly input into each model using identical prompts. Ground-truth IPSS and OABSS total scores were obtained from completed questionnaires. Predictive performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), intraclass correlation coefficient (ICC), and Spearman’s correlation. Classification of clinically significant LUTS (IPSS ≥ 8) and OAB (OABSS ≥ 3 with urgency score ≥ 2) was evaluated using accuracy, Cohen’s kappa (κ), and area under the receiver operating characteristic curve (AUC). Results Among 91 patients (mean age 71.9 ± 10.3 years; 58.2% male), 49 completed IPSS and 78 completed OABSS. For IPSS prediction, MAE/RMSE were 4.88/6.34 for ChatGPT-4o and 4.92/7.01 for ChatGPT-5.1, with no significant differences between models. For OABSS prediction, MAE/RMSE were 1.88/2.76 and 2.03/2.93, respectively, again without significant differences. Agreement was moderate for IPSS (ICC 0.561–0.509) and higher for OABSS (ICC 0.704–0.658). Classification accuracy exceeded 0.85 for both outcomes, and ROC analysis demonstrated comparable discriminative performance across model versions. Conclusions ChatGPT-4o and ChatGPT-5.1 demonstrated comparable and clinically acceptable performance in approximating LUTS-related symptom scores from unstructured clinical narratives. These findings suggest that generative AI models may serve as supportive tools for automated symptom quantification and indicate relative performance stability across model generations within real-world urological practice. Artificial intelligence Large language model ChatGPT Lower urinary tract symptom International Prostate Symptom Score Overactive Bladder Symptom Score Natural language processing Figures Figure 1 Background LUTS encompass a heterogeneous group of conditions involving bladder storage, voiding, and post-micturition phases and are highly prevalent in both men and women, with increasing frequency in older populations. 1 Large epidemiological studies have shown that a substantial proportion of older people experience one or more LUTS, often accompanied by significant symptom bother and impaired health-related quality of life (QoL). 2 , 3 Validated symptom scores, including the IPSS and the OABSS, are widely used to quantify LUTS severity and guide clinical decision-making. 4 , 5 These instruments enable standardized symptom assessment and facilitate longitudinal monitoring and comparative research. However, in routine outpatient practice, structured questionnaires are not always administered, and clinicians frequently rely on free-text symptom narratives to infer symptom severity. This process is time-consuming and subject to inter-rater variability. Recent advances in natural language processing (NLP) and LLMs have enabled the extraction of structured clinical information from unstructured narrative text. Transformer-based language models have demonstrated strong performance across a range of clinical NLP tasks, including concept extraction, document classification, and clinical summarization. 6 , 7 Nevertheless, most prior clinical NLP applications have focused on sub-task-level information extraction rather than direct translation of unstructured patient narratives into validated clinical symptom scores. 8 In prior work, we demonstrated that a generative AI model could approximate LUTS related symptom scores from clinician-documented narratives with moderate accuracy, supporting the feasibility of narrative-based automated symptom scoring in real-world outpatient settings. 9 However, the rapid evolution of generative AI architectures raises the question of whether newer model generations offer meaningful performance improvements for this task. Systematic comparisons across model generations using identical clinical data are needed to clarify the impact of architectural and training advances on clinical performance. To our knowledge, no prior study has directly evaluated the stability of automated symptom score prediction across successive generations of large language models using an identical real-world clinical cohort. In this follow-up study we extended our previous analysis by directly comparing two generative AI models─ChatGPT-4o and ChatGPT-5.1─using the same outpatient clinical dataset. By holding the data source constant, this design allows an isolated evaluation of model-related performance differences. We assessed each models’ ability to predict IPSS and OABSS from patient-reported symptom narratives and evaluated performance against ground-truth questionnaire scores using regression, agreement, and classification metrics. Through this head-to-head comparison, we aimed to determine whether newer-generation LLMs provide incremental benefit for automated LUTS symptom scoring and to further define their potential role in clinical workflows. Materials and Methods Study Design and Setting This retrospective comparative study was conducted as a secondary analysis of a previously reported cohort to evaluate whether generative AI model version updates translate into differences in automated LUTS scoring performance. The primary objective was to directly compare two LLM versions in their ability to predict standardized symptom scores from physician-documented patient narratives. Clinical data were obtained from Seoul National University Boramae Medical Center, a single secondary care institution. Patients who presented to the urology outpatient clinic with LUTS or OAB-related complaints between April 2025 and June 2025 were eligible. A total of 91 patients contributed narrative symptom descriptions documented during routine clinical encounters; among them, 49 completed the IPSS questionnaire and 78 completed the OABSS questionnaire. Patients with incomplete questionnaires were excluded from the corresponding analyses. The study population comprised 53 men and 38 women. AI-based inference was performed separately for each model version using identical narrative inputs and analytic procedures. Predictions using ChatGPT-4o were generated between June and July 2025, whereas predictions using ChatGPT-5.1 were obtained between November and December 2025. All model outputs were generated retrospectively and were not available to clinicians at the time of care. The application of AI models did not influence clinical decision-making or patient management. Data Collection Free-text symptom narratives were retrospectively extracted from electronic medical records (EMRs). During routine outpatient encounters, attending physicians documented patients’ symptom descriptions in Korean as part of standard clinical practice, aiming to preserve patients’ spoken language as accurately as possible. For this study, narratives from eligible patients were retrieved without modification and processed in their original Korean form. Standardized symptom scores were also collected retrospectively from the EMR. Patients completed the Korean-language versions of the IPSS and the OABSS questionnaires immediately after their outpatient visit, and the results were recorded in the medical record. For each patient, the narrative and corresponding questionnaire scores were matched by visit date to ensure temporal alignment. The same prompt structure and zero-shot inference approach were applied to both model versions. Raw narrative text was entered without editing, translation, or filtering. No task-specific fine-tuning was performed. All narratives were manually reviewed to ensure removal of personally identifiable information prior to model input, in accordance with institutional de-identification policies. Statistical Analysis Descriptive statistics were used to summarize patient characteristics, including age, sex distribution, and the proportions completing IPSS and OABSS. Continuous variables were reported as means with standard deviations, and categorical variables were summarized as frequencies and percentages. Prediction accuracy for each model was evaluated using MAE and RMSE. Paired t-tests were applied to compare performance between models. Agreement with ground-truth scores was assessed using ICC and Bland-Altman plots. Spearman’s correlation coefficients were calculated to evaluate monotonic associations between predicted and true scores. For classification analyses, clinically significant LUTS (IPSS ≥ 8) and OAB (OABSS ≥ 3 with urgency ≥ 2) were used as binary outcome thresholds. Model-derived classifications were compared with ground-truth using confusion matrices, accuracy, Cohen’s kappa, and McNemar’s test. Discriminative performance was evaluated using receiver operating characteristics (ROC) curves and the AUC, with comparisons performed using DeLong’s test. Item-level agreement for individual questionnaire components was assessed using weighted kappa coefficients. Statistical significance was defined as two-sided p < 0.05. All analyses were performed using R version 4.4.3. Ethical Considerations This study was conducted in accordance with the principles of the Declaration of Helsinki and was approved by the Institutional Review Board of Seoul National University Boramae Medical Center (IRB No. 30-2025-38). Given the retrospective nature of the study and the use of de-identified clinical data, the requirement for informed consent was waived by the IRB. All patient symptom narratives and questionnaire responses were extracted only after removal of personally identifiable information, and no data used in this study were linked to patient identities. The AI models were applied solely for the retrospective analysis and did not influence clinical decision-making or patient care. Results Study Population A total of 91 patients were included in the analysis, comprising 53 men (58.2%) and 38 women (41.8%), with a mean age of 71.9 ± 10.3 years. Among them, 49 patients completed the IPSS questionnaire and 78 completed the OABSS. The mean ground-truth total score was 15.6 ± 6.7 for IPSS (range, 3–32) and 6.9 ± 3.6 for OABSS (range, 0–14). Based on standard thresholds, 57.1% of IPSS respondents were classified as having moderate symptoms and 28.6% as severe symptoms, while 64.1% of OABSS respondents met criteria for OAB. Detailed baseline characteristics are presented in Table 1 . Table 1 Baseline characteristics of study population Demographic features and questionnaire completion rates are presented for all included patients (N = 91). Distribution of IPSS and OABSS total scores, including severity categories and OAB classification status, are shown for patients who completed each questionnaire. Variables Value (N = 91) Age (years) 71.9 ± 10.3 Sex Male 53 (58.2%) Female 38 (41.8%) Questionnaire completion Completed IPSS 49 (53.8%) Completed OABSS 78 (85.7%) IPSS (n = 49) Mean ± SD 15.6 ± 6.7 Range 3–32 Mild (0–7) 7 (14.3%) Moderate (8–19) 28 (57.1%) Severe (20–35) 14 (28.6%) OABSS (n = 78) Mean ± SD 6.9 ± 3.6 Range 0–14 OAB positive 50 (64.1%) OAB negative 28 (35.9%) Continuous Score Prediction Performance For IPSS total score prediction, ChatGPT-4o achieved an MAE of 4.88 and RMSE of 6.34, whereas ChatGPT-5.1 yielded an MAE of 4.92 and RMSE of 7.01. There were no statistically significant differences between models for either MAE (p = 0.952) or RMSE (p = 0.485). For OABSS prediction, ChatGPT-4o demonstrated an MAE of 1.88 and RMSE of 2.76, compared with 2.03 and 2.93 for ChatGPT-5.1, respectively. Between-model differences again did not reach statistical significance (MAE: p = 0.430; RMSE: p = 0.303). Detailed performance metrics are summarized in Table 2 . Table 2 Comparative performance of ChatGPT-4o and ChatGPT-5.1 for prediction of IPSS and OABSS total scores Performance metrics for continuous score prediction are shown for each model. Accuracy was assessed using MAE and RMSE, agreement using ICC (two-way mixed-effects, absolute agreement), and monotonic association using Spearman’s correlation coefficient. Between-model comparisons for MAE and RMSE were performed using paired t-tests. Score Model MAE RMSE ICC Spearman’s ρ IPSS ChatGPT-4o 4.88 6.34 0.561 0.586 ChatGPT-5.1 4.92 7.01 0.509 0.593 p-value 0.952 0.485 ─ ─ OABSS ChatGPT-4o 1.88 2.76 0.704 0.711 ChatGPT-5.1 2.03 2.93 0.658 0.660 p-value 0.430 0.303 ─ ─ Agreement and Correlation Analysis Intraclass correlation coefficients indicated moderate agreement between AI-generated and ground-truth scores for IPSS (ICC: 0.561 for ChatGPT-4o and 0.509 for ChatGPT-5.1; both p < 0.001). Agreement was higher for OABSS, with ICC values of 0.704 for ChatGPT-4o and 0.658 for ChatGPT-5.1 (both p < 0.001). Spearman’s correlation analyses demonstrated significant positive associations between predicted and true scores. For IPSS, correlation coefficients were 0.586 for ChatGPT-4o and 0.593 for ChatGPT-5.1 (both p < 0.001). For OABSS, correlations were stronger, at 0.711 and 0.660, respectively (both p < 0.001). Bland–Altman analyses showed no evidence of systematic bias across either scoring system or model (Supplementary Figure S1 ). Classification Performance When classifying moderate-to-severe LUTS (IPSS ≥ 8), ChatGPT-4o achieved an accuracy of 0.912 and ChatGPT-5.1 an accuracy of 0.923. Cohen’s kappa values were 0.823 and 0.846, respectively (both p < 0.001), with no significant between-model difference on McNemar’s test (p = 1.00). For OAB classification (OABSS ≥ 3 with urgency ≥ 2), accuracy was 0.879 for ChatGPT-4o and 0.857 for ChatGPT-5.1. Corresponding kappa values were 0.755 and 0.708 (both p < 0.001), and McNemar’s test indicated no significant difference between models (p = 0.752). Classification metrics are presented in Table 3 . Table 3 Classification performance of ChatGPT-4o and ChatGPT-5.1 for clinically significant LUTS and OAB Classification performance metrics for moderate-to-severe LUTS (IPSS ≥ 8) and OAB (OABSS ≥ 3 with urgency ≥ 2). Accuracy, Cohen’s kappa (κ), and area under the receiver operating characteristic curve (AUC) are shown for each model. Between-model comparisons were performed using McNemar’s test for accuracy and DeLong’s test for AUC. Model Accuracy Cohen’s κ AUC LUTS ChatGPT-4o 0.912 0.823 0.744 ChatGPT-5.1 0.923 0.846 0.747 p-value 1.00 † ─ 0.957 # OAB ChatGPT-4o 0.879 0.755 0.880 ChatGPT-5.1 0.857 0.708 0.823 p-value 0.752 † ─ 0.068 # † McNemar test # DeLong test Receiver operating characteristic analysis demonstrated comparable discriminative performance. For LUTS classification, AUC values were 0.744 for ChatGPT-4o and 0.747 for ChatGPT-5.1 (DeLong p = 0.957). For OAB classification, AUC values were 0.880 and 0.823, respectively (DeLong p = 0.068). ROC curves are shown in Fig. 1 . Item-Level Agreement Weighted kappa analyses revealed variable item-level agreement across IPSS components. For ChatGPT-4o, kappa values ranged from 0.21 to 0.51, whereas for ChatGPT-5.1, values ranged from 0.14 to 0.69. Several items—particularly Q3, Q5, and Q7—demonstrated relatively higher agreement with ChatGPT-5.1. For OABSS items, both models showed moderate to substantial agreement. Weighted kappa values ranged from 0.41 to 0.73 for ChatGPT-4o and from 0.40 to 0.61 for ChatGPT-5.1, with the highest agreement consistently observed for urgency and daytime frequency items (Q3 and Q4). Detailed item-level results are provided in Supplementary Table S1 . Discussion This study evaluated the feasibility of automated LUTS scoring derived from clinician-documented patient narratives using two generative AI models, ChatGPT-4o and ChatGPT-5.1. Both models demonstrated moderate predictive accuracy for IPSS and OABSS total scores, with comparable MAE and RMSE values and no statistically significant between-model differences. Agreement metrics indicated moderate concordance with ground-truth scores, while binary classification of clinically significant LUTS and OAB achieved consistently high accuracy across models. Taken together, these findings suggest that contemporary LLMs demonstrate moderate yet consistent performance in translating unstructured clinical narratives into standardized symptom scores, supporting their potential role as adjunctive tools in urological practice. Validated questionnaires such as the IPSS and OABSS remain the reference standard for LUTS assessment. However, real-world outpatient practice often relies on narrative documentation rather than structured instruments. 5 , 10 The capacity of LLMs to operationalize narrative text into structured scores may therefore offer pragmatic value, especially when questionnaires are unavailable. Most prior clinical NLP research has focused on entity recognition, concept extraction, or classification tasks rather than direct quantitative score generation. 11 , 12 Even large clinical language models such as GatorTron primarily demonstrated improvements on benchmark NLP tasks rather than structured symptom score generation. 13 In this context, the present study contributes evidence that generative LLMs can approximate validated symptom scores directly from routine documentation. Importantly, the narratives analyzed in this study were not verbatim patient-generated text but clinician-documented summaries of patient-reported symptoms. Although physicians aim to accurately capture patients’ words, documentation inevitably reflects selective emphasis, paraphrasing, and clinical interpretation. Prior work has demonstrated variability in clinical note composition across providers and institutions, which can introduce systematic documentation bias. 14 Thus, AI-based score prediction from clinician-authored narratives is inherently dependent on the completeness and framing of documentation. The moderate ICC values observed here may therefore reflect not only model limitations but also upstream variability in symptom capture. The stronger performance observed for OABSS compared with IPSS is noteworthy. OABSS items─particularly urgency and frequency─tend to correspond more directly to explicit linguistic expressions in clinical notes. In contrast, IPSS encompasses a broader range of storage and voiding symptoms, including intermittency and straining, which may be described less explicitly or inferred contextually. Linguistic variability in symptom reporting has been previously recognized as a challenge for automated extraction systems. 15 These differences may partially explain the higher ICC and Spearman correlations observed for OABSS. At the item level, notable differences in weighted kappa values between ChatGPT-4o and ChatGPT-5.1 were observed for specific IPSS components, particularly Q3 (intermittency), Q6 (straining), and Q7 (nocturia). These items require nuanced interpretation of temporal patterns, effort, or symptom frequency that may not always be explicitly quantified in narrative documentation. The higher variability between models in these domains may reflect differences in contextual inference strategies or threshold interpretation when converting qualitative descriptions into ordinal response categories. Such discrepancies highlight that improvements in overall model architecture do not necessarily translate uniformly across all symptom dimensions. From a clinical standpoint, automated symptom scoring from narrative text may offer several practical applications. It could reduce clinician workload associated with manual score calculation, support retrospective research using existing EMR data, and enhance telemedicine workflows where structured questionnaires are not systematically administered. However, the moderate agreement levels observed indicate that LLM-derived scores should currently be considered supportive tools rather than substitutes for validated questionnaires—particularly in contexts requiring precise longitudinal monitoring, such as therapeutic response assessment or clinical trials. This study has several strengths, including use of real-world clinician-authored narratives, comparison of two generative AI models, and comprehensive evaluation across accuracy, agreement, correlation, and classification metrics. Nonetheless, limitations should be acknowledged. This single-center design may limit generalizability. Neither model was fine-tuned on urology-specific corpora, and domain adaptation has been shown to substantially improve clinical NLP performance. 16 Additionally, documentation style and completeness vary across providers, potentially influencing model performance. Future research should include multi-center validation, evaluation using patient-generated narratives, and exploration of domain-specific fine-tuning strategies. Prospective integration within EHR systems will be essential to determine real-world clinical impact. Conclusions In conclusion, ChatGPT-4o and ChatGPT-5.1 were able to approximate LUTS-related symptom scores from unstructured clinical narratives with moderate agreement and acceptable predictive accuracy. These findings provide preliminary evidence for the feasibility of generative AI–based automated symptom scoring and suggest potential roles as supportive tools within real-world urology workflows. Abbreviations AUC, Area under curve; ICC, Intraclass correlation coefficient; IPSS, International Prostate Symptom Score; LLM, Large language model; LUTS, Lower urinary tract symptoms; MAE, Mean absolute error; NLP, Natural language processing; OAB, Overactive bladder; OABSS, Overactive Bladder Symptom Score; QoL, Quality of life; RMSE, Root mean squared error; ROC, Receiver operating characteristics Declarations Ethics approval and consent to participate This study was conducted in accordance with the Declaration of Helsinki. Ethical approval was obtained from the Institutional Review Board of Seoul National University Boramae Medical Center (IRB No. 30-2025-38). The requirement for informed consent was waived by the IRB because this was a retrospective medical record review that involved no direct patient contact, posed minimal risk, and used fully de-identified data. Consent for publication Not applicable. Clinical trial number Not applicable. Availability of data and materials The datasets generated and/or analyzed during the current study are not publicly available due to institutional data protection regulations but are available from the corresponding author on reasonable request. Competing interests The authors declare that they have no competing interests. Funding This research received no external funding. Authors’ contributions Research conception and design: Jiehyeon Lee and Hoyoung Bae Data acquisition: Jiehyeon Lee Data analysis and interpretation: Jiehyeon Lee and Hoyoung Bae Statistical analysis: Jiehyeon Lee and Sangjun Yoo Drafting of the manuscript: Jiehyeon Lee and Min Chul Cho Critical revision of the manuscript: Min Chul Cho and Hwancheol Son Administrative, technical, or material support: Jung Hoon Lee Supervision: Hyeon Jeong Approval of the final manuscript: all authors Acknowledgements The authors thank the clinical staff of the Department of Urology at Seoul National University Boramae Medical Center for their support in data acquisition. No external writing or editorial assistance was used. References Irwin DE, Milsom I, Hunskaar S, Reilly K, Kopp Z, Herschorn S, et al. Population-based survey of urinary incontinence, overactive bladder, and other lower urinary tract symptoms in five countries: results of the EPIC study. Eur Urol. 2006;50(6):1306–15. Przydacz M, Gasowski J, Grodzicki T, Chlosta P. Lower urinary tract symptoms and overactive bladder in a large cohort of older Poles—A representative tele-survey. J Clin Med. 2023;12(8):2859. Lee YS, Lee KS, Jung JH, Han DH, Oh SJ, Seo JT, et al. Prevalence of overactive bladder, urinary incontinence, and lower urinary tract symptoms: results of Korean EPIC study. World J Urol. 2011;29(2):185–90. Yao MW, Green JS. How international is the International Prostate Symptom Score? A literature review of validated translations of the IPSS, the most widely used self-administered patient questionnaire for male lower urinary tract symptoms. LUTS. 2022;14(2):92–101. Homma Y, Yoshida M, Seki N, Yokoyama O, Kakizaki H, Gotoh M, et al. Symptom assessment tool for overactive bladder syndrome—Overactive Bladder Symptom Score (OABSS). Urology. 2006;68(2):318–23. Huang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv [Preprint]. 2019; arXiv:1904.05342. Nazi ZA, Peng W. Large language models in healthcare and medical domain: a review. Informatics. 2024;11(3):57. Meystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inf. 2008;17(1):128–44. Bae H, Lee GM, Lee J, et al. Estimation of IPSS and OABSS scores using ChatGPT-4o: a comparative validation study in Korea. BMC Urol. 2026. 10.1186/s12894-026-02054-z . Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80. Barry MJ, Fowler FJ Jr, O'Leary MP, Bruskewitz RC, Holtgrewe HL, Mebust WK, et al. The American Urological Association symptom index for benign prostatic hyperplasia. J Urol. 1992;148(5):1549–57. Yang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194. Cohen GR, Friedman CP, Ryan AM, Richardson CR, Adler-Milstein J. Variation in physicians’ electronic health record documentation and potential patient harm from that variation. J Gen Intern Med. 2019;34(11):2355–67. Meystre S, Haug PJ. Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J Biomed Inf. 2006;39(6):589–99. Alsentzer E, Murphy J, Boag W, Weng WH, Jindi D, Naumann T et al. Publicly available clinical BERT embeddings. Proc 2nd Clin Nat Lang Process Workshop. 2019:72–8. Additional Declarations No competing interests reported. Supplementary Files 21.Supplementaryver1.126.02.21.docx Cite Share Download PDF Status: Under Review Version 1 posted Reviewers agreed at journal 13 May, 2026 Reviews received at journal 15 Apr, 2026 Reviewers agreed at journal 08 Apr, 2026 Reviewers invited by journal 08 Apr, 2026 Editor invited by journal 26 Feb, 2026 Editor assigned by journal 26 Feb, 2026 Submission checks completed at journal 26 Feb, 2026 First submitted to journal 22 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8939728","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":622245515,"identity":"41111cb1-2d1a-49e9-a5cd-99f3f01284cb","order_by":0,"name":"Jie Hyeon Lee","email":"","orcid":"","institution":"Seoul National University Hospital","correspondingAuthor":false,"prefix":"","firstName":"Jie","middleName":"Hyeon","lastName":"Lee","suffix":""},{"id":622245516,"identity":"116ecfab-2553-4659-8530-388361758eb9","order_by":1,"name":"Jung Hoon Lee","email":"","orcid":"","institution":"Seoul National University Boramae Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Jung","middleName":"Hoon","lastName":"Lee","suffix":""},{"id":622245517,"identity":"4fc0e7dc-f8e7-46df-87a5-d8dee206a82d","order_by":2,"name":"Sang Jun Yoo","email":"","orcid":"","institution":"Seoul National University Boramae Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Sang","middleName":"Jun","lastName":"Yoo","suffix":""},{"id":622245518,"identity":"3fd602b1-664d-437b-bf70-3d1db4d760ca","order_by":3,"name":"Min Chul Cho","email":"","orcid":"","institution":"Seoul National University Boramae Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Min","middleName":"Chul","lastName":"Cho","suffix":""},{"id":622245519,"identity":"350fcaae-871f-474c-b301-6f9096132a6a","order_by":4,"name":"Hyeon Jeong","email":"","orcid":"","institution":"Seoul National University Boramae Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Hyeon","middleName":"","lastName":"Jeong","suffix":""},{"id":622245520,"identity":"92f03d06-2e3f-4443-bbf4-bf4ef2301ad5","order_by":5,"name":"Hwan Cheol Son","email":"","orcid":"","institution":"Seoul National University Boramae Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Hwan","middleName":"Cheol","lastName":"Son","suffix":""},{"id":622245521,"identity":"09430e92-6132-4903-aa09-3bc3eae3dcfd","order_by":6,"name":"Ho Young Bae","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA30lEQVRIiWNgGAWjYBACAyidwMDAfADKZiNaC1sCyVp4YGwCWszZewwf8+6ozeNv7/n8ubCNQZ6/gS3tAz4tlj1njI15zxwvljhzdpv0zDYGwxkH2A7PwOuwGzlm0rxtxxIbbuRuY+ZtY2DcwMDejN8vMC3z7795/BmoxZ5YLTWJG27wMAAZDIkbGNgO49dy5lix4dy2A8WGZ9LMpHnOSSTPOMyWjF/L8eaND9621eXJHT/8+DNPmY1tf3ubMV4tDAwcoOiAO0UCmAoIaGBgYH8AJOoIKhsFo2AUjIIRDABKj0fxOFGn7gAAAABJRU5ErkJggg==","orcid":"","institution":"Seoul National University Boramae Medical Center","correspondingAuthor":true,"prefix":"","firstName":"Ho","middleName":"Young","lastName":"Bae","suffix":""}],"badges":[],"createdAt":"2026-02-22 14:24:42","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8939728/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8939728/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":107482266,"identity":"d67f323b-d4fb-4b3c-b8c5-00d8ef039887","added_by":"auto","created_at":"2026-04-22 02:22:50","extension":"jpeg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":394346,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eROC curves comparing ChatGPT-4o and ChatGPT-5.1 for classification of clinically significant LUTS and OAB\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eReceiver operating characteristic (ROC) curves illustrating the classification performance of ChatGPT-4o and ChatGPT-5.1 for (A) moderate-to-severe LUTS (IPSS ≥8) and (B) OAB (OABSS ≥3 with urgency score ≥2). Area under the curve (AUC) values are presented for each model. Between-model differences were evaluated using DeLong’s test.\u003c/p\u003e","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-8939728/v1/67ac159e930325a54a23c447.jpeg"},{"id":107484818,"identity":"7d7cf9d0-1283-43c3-8132-ce378b198f0c","added_by":"auto","created_at":"2026-04-22 02:33:04","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":829515,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8939728/v1/e5d6fac6-d1a0-4e09-9dd4-710fb5b0de52.pdf"},{"id":107242981,"identity":"8b0286af-1190-48af-a6b1-cda9d989fcc7","added_by":"auto","created_at":"2026-04-19 07:48:06","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":76298,"visible":true,"origin":"","legend":"","description":"","filename":"21.Supplementaryver1.126.02.21.docx","url":"https://assets-eu.researchsquare.com/files/rs-8939728/v1/862ded2dd1f385704cf023b3.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Comparative evaluation of ChatGPT-4o and ChatGPT-5.1 for automated LUTS symptom scoring from clinician-documented narratives","fulltext":[{"header":"Background","content":"\u003cp\u003eLUTS encompass a heterogeneous group of conditions involving bladder storage, voiding, and post-micturition phases and are highly prevalent in both men and women, with increasing frequency in older populations.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e Large epidemiological studies have shown that a substantial proportion of older people experience one or more LUTS, often accompanied by significant symptom bother and impaired health-related quality of life (QoL).\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e,\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eValidated symptom scores, including the IPSS and the OABSS, are widely used to quantify LUTS severity and guide clinical decision-making.\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e These instruments enable standardized symptom assessment and facilitate longitudinal monitoring and comparative research. However, in routine outpatient practice, structured questionnaires are not always administered, and clinicians frequently rely on free-text symptom narratives to infer symptom severity. This process is time-consuming and subject to inter-rater variability.\u003c/p\u003e \u003cp\u003eRecent advances in natural language processing (NLP) and LLMs have enabled the extraction of structured clinical information from unstructured narrative text. Transformer-based language models have demonstrated strong performance across a range of clinical NLP tasks, including concept extraction, document classification, and clinical summarization.\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e Nevertheless, most prior clinical NLP applications have focused on sub-task-level information extraction rather than direct translation of unstructured patient narratives into validated clinical symptom scores.\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eIn prior work, we demonstrated that a generative AI model could approximate LUTS related symptom scores from clinician-documented narratives with moderate accuracy, supporting the feasibility of narrative-based automated symptom scoring in real-world outpatient settings.\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e However, the rapid evolution of generative AI architectures raises the question of whether newer model generations offer meaningful performance improvements for this task. Systematic comparisons across model generations using identical clinical data are needed to clarify the impact of architectural and training advances on clinical performance. To our knowledge, no prior study has directly evaluated the stability of automated symptom score prediction across successive generations of large language models using an identical real-world clinical cohort.\u003c/p\u003e \u003cp\u003eIn this follow-up study we extended our previous analysis by directly comparing two generative AI models─ChatGPT-4o and ChatGPT-5.1─using the same outpatient clinical dataset. By holding the data source constant, this design allows an isolated evaluation of model-related performance differences. We assessed each models\u0026rsquo; ability to predict IPSS and OABSS from patient-reported symptom narratives and evaluated performance against ground-truth questionnaire scores using regression, agreement, and classification metrics. Through this head-to-head comparison, we aimed to determine whether newer-generation LLMs provide incremental benefit for automated LUTS symptom scoring and to further define their potential role in clinical workflows.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eStudy Design and Setting\u003c/h2\u003e \u003cp\u003eThis retrospective comparative study was conducted as a secondary analysis of a previously reported cohort to evaluate whether generative AI model version updates translate into differences in automated LUTS scoring performance. The primary objective was to directly compare two LLM versions in their ability to predict standardized symptom scores from physician-documented patient narratives.\u003c/p\u003e \u003cp\u003e Clinical data were obtained from Seoul National University Boramae Medical Center, a single secondary care institution. Patients who presented to the urology outpatient clinic with LUTS or OAB-related complaints between April 2025 and June 2025 were eligible. A total of 91 patients contributed narrative symptom descriptions documented during routine clinical encounters; among them, 49 completed the IPSS questionnaire and 78 completed the OABSS questionnaire. Patients with incomplete questionnaires were excluded from the corresponding analyses. The study population comprised 53 men and 38 women.\u003c/p\u003e \u003cp\u003eAI-based inference was performed separately for each model version using identical narrative inputs and analytic procedures. Predictions using ChatGPT-4o were generated between June and July 2025, whereas predictions using ChatGPT-5.1 were obtained between November and December 2025. All model outputs were generated retrospectively and were not available to clinicians at the time of care. The application of AI models did not influence clinical decision-making or patient management.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eData Collection\u003c/h3\u003e\n\u003cp\u003eFree-text symptom narratives were retrospectively extracted from electronic medical records (EMRs). During routine outpatient encounters, attending physicians documented patients\u0026rsquo; symptom descriptions in Korean as part of standard clinical practice, aiming to preserve patients\u0026rsquo; spoken language as accurately as possible. For this study, narratives from eligible patients were retrieved without modification and processed in their original Korean form.\u003c/p\u003e \u003cp\u003eStandardized symptom scores were also collected retrospectively from the EMR. Patients completed the Korean-language versions of the IPSS and the OABSS questionnaires immediately after their outpatient visit, and the results were recorded in the medical record. For each patient, the narrative and corresponding questionnaire scores were matched by visit date to ensure temporal alignment.\u003c/p\u003e \u003cp\u003eThe same prompt structure and zero-shot inference approach were applied to both model versions. Raw narrative text was entered without editing, translation, or filtering. No task-specific fine-tuning was performed. All narratives were manually reviewed to ensure removal of personally identifiable information prior to model input, in accordance with institutional de-identification policies.\u003c/p\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eStatistical Analysis\u003c/h2\u003e \u003cp\u003eDescriptive statistics were used to summarize patient characteristics, including age, sex distribution, and the proportions completing IPSS and OABSS. Continuous variables were reported as means with standard deviations, and categorical variables were summarized as frequencies and percentages.\u003c/p\u003e \u003cp\u003ePrediction accuracy for each model was evaluated using MAE and RMSE. Paired t-tests were applied to compare performance between models. Agreement with ground-truth scores was assessed using ICC and Bland-Altman plots. Spearman\u0026rsquo;s correlation coefficients were calculated to evaluate monotonic associations between predicted and true scores.\u003c/p\u003e \u003cp\u003eFor classification analyses, clinically significant LUTS (IPSS\u0026thinsp;\u0026ge;\u0026thinsp;8) and OAB (OABSS\u0026thinsp;\u0026ge;\u0026thinsp;3 with urgency\u0026thinsp;\u0026ge;\u0026thinsp;2) were used as binary outcome thresholds. Model-derived classifications were compared with ground-truth using confusion matrices, accuracy, Cohen\u0026rsquo;s kappa, and McNemar\u0026rsquo;s test. Discriminative performance was evaluated using receiver operating characteristics (ROC) curves and the AUC, with comparisons performed using DeLong\u0026rsquo;s test. Item-level agreement for individual questionnaire components was assessed using weighted kappa coefficients. Statistical significance was defined as two-sided p\u0026thinsp;\u0026lt;\u0026thinsp;0.05. All analyses were performed using R version 4.4.3.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eEthical Considerations\u003c/h3\u003e\n\u003cp\u003e This study was conducted in accordance with the principles of the Declaration of Helsinki and was approved by the Institutional Review Board of Seoul National University Boramae Medical Center (IRB No. 30-2025-38). Given the retrospective nature of the study and the use of de-identified clinical data, the requirement for informed consent was waived by the IRB. All patient symptom narratives and questionnaire responses were extracted only after removal of personally identifiable information, and no data used in this study were linked to patient identities. The AI models were applied solely for the retrospective analysis and did not influence clinical decision-making or patient care.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eStudy Population\u003c/h2\u003e \u003cp\u003eA total of 91 patients were included in the analysis, comprising 53 men (58.2%) and 38 women (41.8%), with a mean age of 71.9\u0026thinsp;\u0026plusmn;\u0026thinsp;10.3 years. Among them, 49 patients completed the IPSS questionnaire and 78 completed the OABSS.\u003c/p\u003e \u003cp\u003eThe mean ground-truth total score was 15.6\u0026thinsp;\u0026plusmn;\u0026thinsp;6.7 for IPSS (range, 3\u0026ndash;32) and 6.9\u0026thinsp;\u0026plusmn;\u0026thinsp;3.6 for OABSS (range, 0\u0026ndash;14). Based on standard thresholds, 57.1% of IPSS respondents were classified as having moderate symptoms and 28.6% as severe symptoms, while 64.1% of OABSS respondents met criteria for OAB. Detailed baseline characteristics are presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eBaseline characteristics of study population\u003c/b\u003e Demographic features and questionnaire completion rates are presented for all included patients (N\u0026thinsp;=\u0026thinsp;91). Distribution of IPSS and OABSS total scores, including severity categories and OAB classification status, are shown for patients who completed each questionnaire.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVariables\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eValue (N\u0026thinsp;=\u0026thinsp;91)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eAge (years)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e71.9\u0026thinsp;\u0026plusmn;\u0026thinsp;10.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eSex\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e53 (58.2%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFemale\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e38 (41.8%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eQuestionnaire completion\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCompleted IPSS\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e49 (53.8%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCompleted OABSS\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e78 (85.7%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eIPSS (n\u0026thinsp;=\u0026thinsp;49)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e15.6\u0026thinsp;\u0026plusmn;\u0026thinsp;6.7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRange\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3\u0026ndash;32\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMild (0\u0026ndash;7)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7 (14.3%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModerate (8\u0026ndash;19)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e28 (57.1%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSevere (20\u0026ndash;35)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e14 (28.6%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eOABSS (n\u0026thinsp;=\u0026thinsp;78)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e6.9\u0026thinsp;\u0026plusmn;\u0026thinsp;3.6\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRange\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u0026ndash;14\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOAB positive\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e50 (64.1%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOAB negative\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e28 (35.9%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eContinuous Score Prediction Performance\u003c/h3\u003e\n\u003cp\u003eFor IPSS total score prediction, ChatGPT-4o achieved an MAE of 4.88 and RMSE of 6.34, whereas ChatGPT-5.1 yielded an MAE of 4.92 and RMSE of 7.01. There were no statistically significant differences between models for either MAE (p\u0026thinsp;=\u0026thinsp;0.952) or RMSE (p\u0026thinsp;=\u0026thinsp;0.485).\u003c/p\u003e \u003cp\u003eFor OABSS prediction, ChatGPT-4o demonstrated an MAE of 1.88 and RMSE of 2.76, compared with 2.03 and 2.93 for ChatGPT-5.1, respectively. Between-model differences again did not reach statistical significance (MAE: p\u0026thinsp;=\u0026thinsp;0.430; RMSE: p\u0026thinsp;=\u0026thinsp;0.303). Detailed performance metrics are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eComparative performance of ChatGPT-4o and ChatGPT-5.1 for prediction of IPSS and OABSS total scores\u003c/b\u003e Performance metrics for continuous score prediction are shown for each model. Accuracy was assessed using MAE and RMSE, agreement using ICC (two-way mixed-effects, absolute agreement), and monotonic association using Spearman\u0026rsquo;s correlation coefficient. Between-model comparisons for MAE and RMSE were performed using paired t-tests.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eScore\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMAE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eRMSE\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eICC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eSpearman\u0026rsquo;s ρ\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003eIPSS\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e6.34\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.561\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.586\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e4.92\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e7.01\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.509\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.593\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003ep-value\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.952\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.485\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e─\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e─\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003eOABSS\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e1.88\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.76\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.704\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.711\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2.03\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e2.93\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.658\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.660\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003ep-value\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.430\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.303\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e─\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e─\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e\n\u003ch3\u003eAgreement and Correlation Analysis\u003c/h3\u003e\n\u003cp\u003eIntraclass correlation coefficients indicated moderate agreement between AI-generated and ground-truth scores for IPSS (ICC: 0.561 for ChatGPT-4o and 0.509 for ChatGPT-5.1; both p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). Agreement was higher for OABSS, with ICC values of 0.704 for ChatGPT-4o and 0.658 for ChatGPT-5.1 (both p\u0026thinsp;\u0026lt;\u0026thinsp;0.001).\u003c/p\u003e \u003cp\u003eSpearman\u0026rsquo;s correlation analyses demonstrated significant positive associations between predicted and true scores. For IPSS, correlation coefficients were 0.586 for ChatGPT-4o and 0.593 for ChatGPT-5.1 (both p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). For OABSS, correlations were stronger, at 0.711 and 0.660, respectively (both p\u0026thinsp;\u0026lt;\u0026thinsp;0.001).\u003c/p\u003e \u003cp\u003eBland\u0026ndash;Altman analyses showed no evidence of systematic bias across either scoring system or model (Supplementary Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e).\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eClassification Performance\u003c/h2\u003e \u003cp\u003eWhen classifying moderate-to-severe LUTS (IPSS\u0026thinsp;\u0026ge;\u0026thinsp;8), ChatGPT-4o achieved an accuracy of 0.912 and ChatGPT-5.1 an accuracy of 0.923. Cohen\u0026rsquo;s kappa values were 0.823 and 0.846, respectively (both p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), with no significant between-model difference on McNemar\u0026rsquo;s test (p\u0026thinsp;=\u0026thinsp;1.00).\u003c/p\u003e \u003cp\u003eFor OAB classification (OABSS\u0026thinsp;\u0026ge;\u0026thinsp;3 with urgency\u0026thinsp;\u0026ge;\u0026thinsp;2), accuracy was 0.879 for ChatGPT-4o and 0.857 for ChatGPT-5.1. Corresponding kappa values were 0.755 and 0.708 (both p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), and McNemar\u0026rsquo;s test indicated no significant difference between models (p\u0026thinsp;=\u0026thinsp;0.752). Classification metrics are presented in Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eClassification performance of ChatGPT-4o and ChatGPT-5.1 for clinically significant LUTS and OAB\u003c/b\u003e Classification performance metrics for moderate-to-severe LUTS (IPSS\u0026thinsp;\u0026ge;\u0026thinsp;8) and OAB (OABSS\u0026thinsp;\u0026ge;\u0026thinsp;3 with urgency\u0026thinsp;\u0026ge;\u0026thinsp;2). Accuracy, Cohen\u0026rsquo;s kappa (κ), and area under the receiver operating characteristic curve (AUC) are shown for each model. Between-model comparisons were performed using McNemar\u0026rsquo;s test for accuracy and DeLong\u0026rsquo;s test for AUC.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCohen\u0026rsquo;s κ\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eAUC\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003eLUTS\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.912\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.823\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.744\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.923\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.846\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.747\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003ep-value\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e1.00\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e─\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.957\u003csup\u003e#\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"2\" rowspan=\"3\"\u003e \u003cp\u003e\u003cb\u003eOAB\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.879\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.755\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.880\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.857\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.708\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.823\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e\u003cem\u003ep-value\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.752\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e─\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.068\u003csup\u003e#\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"5\"\u003e\u0026dagger; McNemar test\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"5\"\u003e# DeLong test\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eReceiver operating characteristic analysis demonstrated comparable discriminative performance. For LUTS classification, AUC values were 0.744 for ChatGPT-4o and 0.747 for ChatGPT-5.1 (DeLong p\u0026thinsp;=\u0026thinsp;0.957). For OAB classification, AUC values were 0.880 and 0.823, respectively (DeLong p\u0026thinsp;=\u0026thinsp;0.068). ROC curves are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eItem-Level Agreement\u003c/h2\u003e \u003cp\u003eWeighted kappa analyses revealed variable item-level agreement across IPSS components. For ChatGPT-4o, kappa values ranged from 0.21 to 0.51, whereas for ChatGPT-5.1, values ranged from 0.14 to 0.69. Several items\u0026mdash;particularly Q3, Q5, and Q7\u0026mdash;demonstrated relatively higher agreement with ChatGPT-5.1.\u003c/p\u003e \u003cp\u003eFor OABSS items, both models showed moderate to substantial agreement. Weighted kappa values ranged from 0.41 to 0.73 for ChatGPT-4o and from 0.40 to 0.61 for ChatGPT-5.1, with the highest agreement consistently observed for urgency and daytime frequency items (Q3 and Q4). Detailed item-level results are provided in Supplementary Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study evaluated the feasibility of automated LUTS scoring derived from clinician-documented patient narratives using two generative AI models, ChatGPT-4o and ChatGPT-5.1. Both models demonstrated moderate predictive accuracy for IPSS and OABSS total scores, with comparable MAE and RMSE values and no statistically significant between-model differences. Agreement metrics indicated moderate concordance with ground-truth scores, while binary classification of clinically significant LUTS and OAB achieved consistently high accuracy across models. Taken together, these findings suggest that contemporary LLMs demonstrate moderate yet consistent performance in translating unstructured clinical narratives into standardized symptom scores, supporting their potential role as adjunctive tools in urological practice.\u003c/p\u003e \u003cp\u003eValidated questionnaires such as the IPSS and OABSS remain the reference standard for LUTS assessment. However, real-world outpatient practice often relies on narrative documentation rather than structured instruments.\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e The capacity of LLMs to operationalize narrative text into structured scores may therefore offer pragmatic value, especially when questionnaires are unavailable.\u003c/p\u003e \u003cp\u003eMost prior clinical NLP research has focused on entity recognition, concept extraction, or classification tasks rather than direct quantitative score generation.\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e Even large clinical language models such as GatorTron primarily demonstrated improvements on benchmark NLP tasks rather than structured symptom score generation.\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e In this context, the present study contributes evidence that generative LLMs can approximate validated symptom scores directly from routine documentation.\u003c/p\u003e \u003cp\u003eImportantly, the narratives analyzed in this study were not verbatim patient-generated text but clinician-documented summaries of patient-reported symptoms. Although physicians aim to accurately capture patients\u0026rsquo; words, documentation inevitably reflects selective emphasis, paraphrasing, and clinical interpretation. Prior work has demonstrated variability in clinical note composition across providers and institutions, which can introduce systematic documentation bias.\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e Thus, AI-based score prediction from clinician-authored narratives is inherently dependent on the completeness and framing of documentation. The moderate ICC values observed here may therefore reflect not only model limitations but also upstream variability in symptom capture.\u003c/p\u003e \u003cp\u003eThe stronger performance observed for OABSS compared with IPSS is noteworthy. OABSS items─particularly urgency and frequency─tend to correspond more directly to explicit linguistic expressions in clinical notes. In contrast, IPSS encompasses a broader range of storage and voiding symptoms, including intermittency and straining, which may be described less explicitly or inferred contextually. Linguistic variability in symptom reporting has been previously recognized as a challenge for automated extraction systems.\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e These differences may partially explain the higher ICC and Spearman correlations observed for OABSS.\u003c/p\u003e \u003cp\u003eAt the item level, notable differences in weighted kappa values between ChatGPT-4o and ChatGPT-5.1 were observed for specific IPSS components, particularly Q3 (intermittency), Q6 (straining), and Q7 (nocturia). These items require nuanced interpretation of temporal patterns, effort, or symptom frequency that may not always be explicitly quantified in narrative documentation. The higher variability between models in these domains may reflect differences in contextual inference strategies or threshold interpretation when converting qualitative descriptions into ordinal response categories. Such discrepancies highlight that improvements in overall model architecture do not necessarily translate uniformly across all symptom dimensions.\u003c/p\u003e \u003cp\u003eFrom a clinical standpoint, automated symptom scoring from narrative text may offer several practical applications. It could reduce clinician workload associated with manual score calculation, support retrospective research using existing EMR data, and enhance telemedicine workflows where structured questionnaires are not systematically administered. However, the moderate agreement levels observed indicate that LLM-derived scores should currently be considered supportive tools rather than substitutes for validated questionnaires\u0026mdash;particularly in contexts requiring precise longitudinal monitoring, such as therapeutic response assessment or clinical trials.\u003c/p\u003e \u003cp\u003eThis study has several strengths, including use of real-world clinician-authored narratives, comparison of two generative AI models, and comprehensive evaluation across accuracy, agreement, correlation, and classification metrics. Nonetheless, limitations should be acknowledged. This single-center design may limit generalizability. Neither model was fine-tuned on urology-specific corpora, and domain adaptation has been shown to substantially improve clinical NLP performance.\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e Additionally, documentation style and completeness vary across providers, potentially influencing model performance.\u003c/p\u003e \u003cp\u003eFuture research should include multi-center validation, evaluation using patient-generated narratives, and exploration of domain-specific fine-tuning strategies. Prospective integration within EHR systems will be essential to determine real-world clinical impact.\u003c/p\u003e"},{"header":"Conclusions","content":"\u003cp\u003eIn conclusion, ChatGPT-4o and ChatGPT-5.1 were able to approximate LUTS-related symptom scores from unstructured clinical narratives with moderate agreement and acceptable predictive accuracy. These findings provide preliminary evidence for the feasibility of generative AI\u0026ndash;based automated symptom scoring and suggest potential roles as supportive tools within real-world urology workflows.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eAUC, Area under curve;\u003c/p\u003e\n\u003cp\u003eICC, Intraclass correlation coefficient;\u003c/p\u003e\n\u003cp\u003eIPSS, International Prostate Symptom Score;\u003c/p\u003e\n\u003cp\u003eLLM, Large language model;\u003c/p\u003e\n\u003cp\u003eLUTS, Lower urinary tract symptoms;\u003c/p\u003e\n\u003cp\u003eMAE, Mean absolute error;\u003c/p\u003e\n\u003cp\u003eNLP, Natural language processing;\u003c/p\u003e\n\u003cp\u003eOAB, Overactive bladder;\u003c/p\u003e\n\u003cp\u003eOABSS, Overactive Bladder Symptom Score;\u003c/p\u003e\n\u003cp\u003eQoL, Quality of life;\u003c/p\u003e\n\u003cp\u003eRMSE, Root mean squared error;\u003c/p\u003e\n\u003cp\u003eROC, Receiver operating characteristics\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was conducted in accordance with the Declaration of Helsinki. Ethical approval was obtained from the Institutional Review Board of Seoul National University Boramae Medical Center (IRB No. 30-2025-38). The requirement for informed consent was waived by the IRB because this was a retrospective medical record review that involved no direct patient contact, posed minimal risk, and used fully de-identified data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eClinical trial number\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets generated and/or analyzed during the current study are not publicly available due to institutional data protection regulations but are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research received no external funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors’ contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eResearch conception and design: Jiehyeon Lee and Hoyoung Bae\u003c/p\u003e\n\u003cp\u003eData acquisition: Jiehyeon Lee\u003c/p\u003e\n\u003cp\u003eData analysis and interpretation: Jiehyeon Lee and Hoyoung Bae\u003c/p\u003e\n\u003cp\u003eStatistical analysis: Jiehyeon Lee and Sangjun Yoo\u003c/p\u003e\n\u003cp\u003eDrafting of the manuscript: Jiehyeon Lee and Min Chul Cho\u003c/p\u003e\n\u003cp\u003eCritical revision of the manuscript: Min Chul Cho and Hwancheol Son\u003c/p\u003e\n\u003cp\u003eAdministrative, technical, or material support: Jung Hoon Lee\u003c/p\u003e\n\u003cp\u003eSupervision: Hyeon Jeong\u003c/p\u003e\n\u003cp\u003eApproval of the final manuscript: all authors\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors thank the clinical staff of the Department of Urology at Seoul National University Boramae Medical Center for their support in data acquisition. No external writing or editorial assistance was used.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eIrwin DE, Milsom I, Hunskaar S, Reilly K, Kopp Z, Herschorn S, et al. Population-based survey of urinary incontinence, overactive bladder, and other lower urinary tract symptoms in five countries: results of the EPIC study. Eur Urol. 2006;50(6):1306\u0026ndash;15.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePrzydacz M, Gasowski J, Grodzicki T, Chlosta P. Lower urinary tract symptoms and overactive bladder in a large cohort of older Poles\u0026mdash;A representative tele-survey. J Clin Med. 2023;12(8):2859.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee YS, Lee KS, Jung JH, Han DH, Oh SJ, Seo JT, et al. Prevalence of overactive bladder, urinary incontinence, and lower urinary tract symptoms: results of Korean EPIC study. World J Urol. 2011;29(2):185\u0026ndash;90.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYao MW, Green JS. How international is the International Prostate Symptom Score? A literature review of validated translations of the IPSS, the most widely used self-administered patient questionnaire for male lower urinary tract symptoms. LUTS. 2022;14(2):92\u0026ndash;101.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHomma Y, Yoshida M, Seki N, Yokoyama O, Kakizaki H, Gotoh M, et al. Symptom assessment tool for overactive bladder syndrome\u0026mdash;Overactive Bladder Symptom Score (OABSS). Urology. 2006;68(2):318\u0026ndash;23.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuang K, Altosaar J, Ranganath R. ClinicalBERT: modeling clinical notes and predicting hospital readmission. arXiv [Preprint]. 2019; arXiv:1904.05342.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNazi ZA, Peng W. Large language models in healthcare and medical domain: a review. Informatics. 2024;11(3):57.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMeystre SM, Savova GK, Kipper-Schuler KC, Hurdle JF. Extracting information from textual documents in the electronic health record: a review of recent research. Yearb Med Inf. 2008;17(1):128\u0026ndash;44.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBae H, Lee GM, Lee J, et al. Estimation of IPSS and OABSS scores using ChatGPT-4o: a comparative validation study in Korea. BMC Urol. 2026. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12894-026-02054-z\u003c/span\u003e\u003cspan address=\"10.1186/s12894-026-02054-z\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepa\u0026ntilde;o C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSinghal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172\u0026ndash;80.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBarry MJ, Fowler FJ Jr, O'Leary MP, Bruskewitz RC, Holtgrewe HL, Mebust WK, et al. The American Urological Association symptom index for benign prostatic hyperplasia. J Urol. 1992;148(5):1549\u0026ndash;57.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang X, Chen A, PourNejatian N, Shin HC, Smith KE, Parisien C, et al. A large language model for electronic health records. NPJ Digit Med. 2022;5(1):194.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCohen GR, Friedman CP, Ryan AM, Richardson CR, Adler-Milstein J. Variation in physicians\u0026rsquo; electronic health record documentation and potential patient harm from that variation. J Gen Intern Med. 2019;34(11):2355\u0026ndash;67.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMeystre S, Haug PJ. Natural language processing to extract medical problems from electronic clinical documents: performance evaluation. J Biomed Inf. 2006;39(6):589\u0026ndash;99.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlsentzer E, Murphy J, Boag W, Weng WH, Jindi D, Naumann T et al. Publicly available clinical BERT embeddings. Proc 2nd Clin Nat Lang Process Workshop. 2019:72\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-urology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"buro","sideBox":"Learn more about [BMC Urology](http://bmcurol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/buro/default.aspx","title":"BMC Urology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Artificial intelligence, Large language model, ChatGPT, Lower urinary tract symptom, International Prostate Symptom Score, Overactive Bladder Symptom Score, Natural language processing","lastPublishedDoi":"10.21203/rs.3.rs-8939728/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8939728/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eLower urinary tract symptoms (LUTS) are commonly quantified using validated instruments such as the International Prostate Symptom Score (IPSS) and the Overactive Bladder Symptom Score (OABSS). However, structured questionnaires are not consistently administered in routine outpatient practice, where clinicians frequently rely on narrative documentation. While generative large language models (LLMs) have demonstrated potential in clinical natural language processing, their performance stability across successive model generations for automated symptom scoring remains unclear. We conducted a head-to-head comparison of ChatGPT-4o and ChatGPT-5.1 to evaluate generational performance differences in automated LUTS scoring.\u003c/p\u003e\u003ch2\u003eMaterials and Methods\u003c/h2\u003e \u003cp\u003eThis retrospective single-center study included 91 patients presenting with LUTS or overactive bladder (OAB)-related complaints between April and June 2025. Free-text symptom narratives documented in electronic medical records were directly input into each model using identical prompts. Ground-truth IPSS and OABSS total scores were obtained from completed questionnaires. Predictive performance was assessed using mean absolute error (MAE), root mean squared error (RMSE), intraclass correlation coefficient (ICC), and Spearman\u0026rsquo;s correlation. Classification of clinically significant LUTS (IPSS\u0026thinsp;\u0026ge;\u0026thinsp;8) and OAB (OABSS\u0026thinsp;\u0026ge;\u0026thinsp;3 with urgency score\u0026thinsp;\u0026ge;\u0026thinsp;2) was evaluated using accuracy, Cohen\u0026rsquo;s kappa (κ), and area under the receiver operating characteristic curve (AUC).\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eAmong 91 patients (mean age 71.9\u0026thinsp;\u0026plusmn;\u0026thinsp;10.3 years; 58.2% male), 49 completed IPSS and 78 completed OABSS. For IPSS prediction, MAE/RMSE were 4.88/6.34 for ChatGPT-4o and 4.92/7.01 for ChatGPT-5.1, with no significant differences between models. For OABSS prediction, MAE/RMSE were 1.88/2.76 and 2.03/2.93, respectively, again without significant differences. Agreement was moderate for IPSS (ICC 0.561\u0026ndash;0.509) and higher for OABSS (ICC 0.704\u0026ndash;0.658). Classification accuracy exceeded 0.85 for both outcomes, and ROC analysis demonstrated comparable discriminative performance across model versions.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eChatGPT-4o and ChatGPT-5.1 demonstrated comparable and clinically acceptable performance in approximating LUTS-related symptom scores from unstructured clinical narratives. These findings suggest that generative AI models may serve as supportive tools for automated symptom quantification and indicate relative performance stability across model generations within real-world urological practice.\u003c/p\u003e","manuscriptTitle":"Comparative evaluation of ChatGPT-4o and ChatGPT-5.1 for automated LUTS symptom scoring from clinician-documented narratives","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-19 07:48:02","doi":"10.21203/rs.3.rs-8939728/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewerAgreed","content":"193952353073250095303426682426662040460","date":"2026-05-13T06:51:27+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-15T23:34:26+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"186046368366645900846544573997417834739","date":"2026-04-08T13:45:57+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-04-08T05:52:58+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-02-26T12:52:36+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-26T07:45:25+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-26T07:38:21+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Urology","date":"2026-02-22T14:08:11+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-urology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"buro","sideBox":"Learn more about [BMC Urology](http://bmcurol.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/buro/default.aspx","title":"BMC Urology","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"076e33ab-8db3-4a61-a254-b15975a5a3a6","owner":[],"postedDate":"April 19th, 2026","published":true,"recentEditorialEvents":[{"type":"reviewerAgreed","content":"193952353073250095303426682426662040460","date":"2026-05-13T06:51:27+00:00","index":61,"fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-19T07:48:02+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-19 07:48:02","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8939728","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8939728","identity":"rs-8939728","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00