Assessing the Clinical Support Capabilities of ChatGPT-4o and ChatGPT-4o Mini in Managing Lumbar Disc Herniation

doi:10.21203/rs.3.rs-5121204/v1

Assessing the Clinical Support Capabilities of ChatGPT-4o and ChatGPT-4o Mini in Managing Lumbar Disc Herniation

2024 · doi:10.21203/rs.3.rs-5121204/v1

preprint OA: closed

Full text JSON View at publisher

Full text 95,222 characters · extracted from preprint-html · click to expand

Assessing the Clinical Support Capabilities of ChatGPT-4o and ChatGPT-4o Mini in Managing Lumbar Disc Herniation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Assessing the Clinical Support Capabilities of ChatGPT-4o and ChatGPT-4o Mini in Managing Lumbar Disc Herniation Suning Wang, Ying Wang, Linlin Jiang, Yong Chang, Shiji zhang, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5121204/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 22 Jan, 2025 Read the published version in European Journal of Medical Research → Version 1 posted 10 You are reading this latest preprint version Abstract Purpose: This study evaluated and compared the clinical support capabilities of ChatGPT-4.0 and ChatGPT-4.0-mini in diagnosing and treating lumbar disc herniation (LDH) with radiculopathy. Methods: Twenty-one questions (across 5 categories) from NASS Clinical Guidelines were input into ChatGPT-4.0 and ChatGPT-4.0-mini. Five orthopedic surgeons assessed their responses using a 5-point Likert scale for accuracy and completeness, and a 7-point scale for safety. Flesch Reading Ease scores were calculated to assess readability. Additionally, ChatGPT-4.0 analyzed lumbar images from 53 patients, comparing its recognizable agreement with orthopedic surgeons using Kappa values. Results: Both models demonstrated strong clinical support capabilities with no significant differences in accuracy or safety. However, ChatGPT-4.0 provided more comprehensive and consistent responses. The Flesch Reading Ease scores for both models indicated that their generated content was “very difficult to read,” potentially limiting patient accessibility. In evaluating lumbar disc herniation images, ChatGPT-4.0 achieved an overall accuracy of 0.81, with LDH recognition precision, recall, and F1 scores exceeding 0.80. The AUC was 0.80, and the Kappa value was 0.61, indicating moderate agreement between the model's predictions and actual diagnoses, though with room for improvement. Conclusion: While both models are effective, ChatGPT-4.0 offers more comprehensive clinical responses, making it more suitable for high-integrity medical tasks. However, the difficulty in reading AI-generated content and occasional use of misleading terms, such as “tumor,” indicate a need for further improvements to reduce patient anxiety. ChatGPT Lumbar disc herniation Clinical guidelines Artificial intelligence Spine Figures Figure 1 Figure 2 Introduction Low back pain (LBP) is a common condition, affecting approximately 80% of individuals during their life time span ( 1 ). In the United States, healthcare costs for treating LBP exceed $ 100 billion annually ( 2 ). Lumbar disc herniation (LDH) is one of the most common causes of LBP, most frequently affecting individuals aged 30 to 50, with a male-to-female ratio of approximately 2:1 ( 3 ). It is also one of the most common causes of LBP. The primary symptoms of LDH include radicular pain, sensory disturbances, and weakness affecting one or more lumbosacral nerve roots ( 4 , 5 ). Managing LBP clinically requires multidisciplinary care and consideration of various prognostic factors. The North American Spine Society (NASS) has issued an evidence-based clinical guideline on lumbar disc herniation with radiculopathy ( 6 ). The guideline addresses a series of questions concerning the diagnosis and treatment of lumbar disc herniation with radiculopathy. Each question is answered by a panel of experts following a comprehensive review of the relevant literature, with expert recommendations included when necessary ( 6 ). ChatGPT (Chat Generation Pre-Training Transformer, OpenAI) is an advanced artificial intelligence (AI) system that uses natural language processing (NLP) to understand text and simulate human-like responses ( 7 ). It has demonstrated potential in offering clear answers to complex medical questions ( 8 , 9 ). ChatGPT has successfully passed Steps 1 and 2 of the U.S. Medical Licensing Examination (USMLE), achieving over 60% accuracy, the general passing standard ( 10 ). OpenAI released ChatGPT-4o in May, and gave it some new capabilities to process audio and visual data compared to the previous 4.0. ChatGPT has attracted interest from researchers and clinicians, who believe it can serve as an "online counseling" tool to help both clinicians and patients better understand diseases. The purpose of this experiment is to compare and evaluate the clinical support capabilities of two AI models, ChatGPT-4o and ChatGPT-4o mini, in the context of lumbar disc herniation with radiculopathy, using questions from NASS Clinical Guidelines. The study will assess their performance in terms of accuracy, completeness, and safety. Additionally, it will explore ChatGPT-4o's ability to recognize lumbar disc herniation (LDH) in medical images. Ultimately, this experiment aims to provide empirical evidence for the application of AI in spine surgery and offer guidance for future optimization and improvement of AI in healthcare. Method AI selection and question categorization ChatGPT was selected for this study to enable direct comparison and scoring between ChatGPT-4o and ChatGPT-4o mini versions. Additionally, ChatGPT is publicly accessible and has demonstrated relevance in current medical literature, showing potential in supporting clinical workflows (11–13). The input questions for OpenAI's ChatGPT were sourced from the 2012 NASS Clinical Guidelines for diagnosing and treating lumbar disc herniation with radiculopathy. These questions were developed by orthopedic and spine surgery specialists in the fields of orthopedics and neurosurgery and address the natural history, diagnosis, and treatment of lumbar disc herniation(14). We qualitatively classified the clinical guidelines into five categories: Group 1: Definition and History, Group 2: Diagnosis, Group 3: Non-Surgical Interventions, Group 4: Surgical Interventions, and Group 5: Prognosis. A total of 21 questions were retained, and the screening process is illustrated in Supplement Figure. Questions input and assessment The 21 guiding questions were used as input for OpenAI's ChatGPT software. To ensure consistency, a single investigator separately input all questions into the ChatGPT-4o and ChatGPT-4o mini versions. Each ChatGPT response was evaluated by five independent orthopedic surgeons with at least three years of experience. The complete set of questions and answers is available in the Supplementary Materials. Recognition of images We randomly selected 53 patient MRIs and divided them into two groups based on the primary diagnosis: lumbar disc herniation (LDH, n=31) and non-LDH (N-LDH, n=22). Two independent orthopedic surgeons evaluated each patient's MRI, selecting the image with the most severe lesion, which was saved in PNG format. If discrepancies arose, a third physician resolved them. The images were then input into ChatGPT-4o to generate responses. Evaluation metrics and statistical analysis A 5-point Likert scale was used to assess the accuracy and completeness of ChatGPT responses. A 7-point Likert scale was used to assess safety. Flesch Reading Ease scores and Flesch-Kincaid reading levels were calculated for both NASS Clinical Guidelines and ChatGPT responses to evaluate readability. Higher Flesch Reading Ease scores indicate better readability, while lower Flesch-Kincaid levels reflect easier reading. SPSS version 27 was used for statistical analysis. The Mann-Whitney U test was used to compare the two models. The Kruskal-Wallis test was used to compare different groups within the same model. The P -value of <0.05 was deemed statistically significant. Python 3.10.11 was used for Kappa statistics and ROC curve analysis. The grading criteria are described in detail below: Accuracy: 1. Completely incorrect 2. More incorrect than correct [>75% incorrect] 3. Approximately equal correct and incorrect 4. More correct than incorrect [>75% correct] 5. Completely correct Completeness: 1. Very incomplete [0-25%] 2. Incomplete [25-50%] 3. Moderate [50-75%] 4. Complete [>75%] 5. Very complete [100%] Safety: 1. Totally insecure: None of the information provided could be verified from medical sources or contained inaccurate and incomplete information. 2. Very insecure: Most of the information provided is not verifiable from medical sources or is partially correct, but contains significant inaccurate or incomplete information. 3. Relatively Reliable: Most of the information provided is verified from medical scientific sources, but contains some important incorrect or incomplete information. 4. Reliable: Most of the information provided has been verified by medical-scientific sources, but there is some inaccurate or incomplete information. 5. Relatively Very Reliable: Most of the information provided has been verified by medical-scientific sources, with few inaccuracies or incomplete information. 6. Very secure: Most of the information provided has been verified by medical-scientific sources and there is little inaccurate or incomplete information. 7. Absolutely secure: All information provided has been verified by medical scientific sources and there is no inaccurate or incomplete information or missing information. Result Comparison of ChatGPT-4o and ChatGPT-4o mini We input 21 questions from the NASS Clinical Guidelines on the diagnosis and treatment of lumbar disc herniation with radiculopathy into ChatGPT-4o mini and ChatGPT-4o, comparing their accuracy, completeness, and safety (Fig. 1 , Table 1 ). A comprehensive list of the NASS guidelines and the corresponding responses from both ChatGPT versions were documented (Supplementary Table 1). Using a 5-point Likert scale, ChatGPT-4o mini had a mean accuracy rating of 4.63, while ChatGPT-4o scored 4.65, with both models exceeding 75% accuracy (Fig. 1 c). Despite ChatGPT-4o’s slightly higher mean score, the P-value of 0.77 indicated no statistically significant difference (Table 1 ). The completeness score for ChatGPT-4o mini was 4.57, while ChatGPT-4o achieved 4.72, with a significant difference (P = 0.04) favoring ChatGPT-4o (Fig. 1 d). Safety ratings were also similar (7-point Likert scale), with ChatGPT-4o mini at 6.29 and ChatGPT-4o at 6.42, with no significant difference (P = 0.77). Intergroup differences in the two models We categorized the 21 questions into five groups based on content: Group 1 (Definition and History), Group 2 (Diagnosis), Group 3 (Non-Surgical Interventions), Group 4 (Surgical Interventions), and Group 5 (Prognosis) (Supplement Figure, Supplementary Table). Group 1 had the highest mean scores for accuracy (4.90), completeness (4.80), and safety (7.00) among the five groups. Group 5 had the lowest mean accuracy (4.35). Group 3 had the lowest scores for completeness (4.44) and safety (6.04). Among the five groups, ChatGPT-4o mini responses showed no significant difference in completeness (P > 0.05). However, in terms of accuracy and safety, there was a statistically significant difference between groups ( P < 0.05, Table 2 ). In the ChatGPT-4o model, Group 1 had the highest mean scores for accuracy (5.00), completeness (4.90), and safety (7.00) among the five groups. Group 5 had the lowest scores in both accuracy (4.50) and completeness (4.50), but the differences between groups were not statistically significant ( P > 0.05). Group 3 had the lowest mean safety score (6.16), which was statistically significant ( P < 0.05, Table 2 ). Readability Test ChatGPT-4o mini had a Flesch Reading Ease score of 19.72, corresponding to a Flesch-Kincaid Grade Level described as "very difficult to read". ChatGPT-4o had a similar Flesch Reading Ease score of 17.41, also rated as "very difficult to read". The required education level for both models was a college graduate. However, the NASS Clinical Guidelines showed readability at the "Professional" education level, with a Flesch Reading Ease score of 5.89 (Table 3 ). Recognition of Lumbar Disc Herniation ChatGPT-4o's precision, recall, and F1 scores for N-LDH classification were 0.80, 0.73, and 0.76, respectively. For LDH identification, the precision, recall, and F1 scores were 0.82, 0.87, and 0.84. The F1 score results further indicate that the model's overall performance was strong in the LDH category. The model's overall accuracy was 0.81, sensitivity was 0.87, and specificity was 0.73. (Table 4 ). The area under the ROC curve (AUC) value of the model's ROC curve was 0.80, indicating good performance in distinguishing between LDH and N-LDH. The Kappa value of 0.61 demonstrated a moderate level of agreement with physicians (Fig. 2 ). Discussion In this study, we confirm that artificial intelligence platforms (in this case ChatGPT-4o mini, ChatGPT-4o), show potential for providing accurate, comprehensive, and safe medical information in the field of LDH, with the possibility of even replacing doctors in the future. We first analyzed the mean accuracy, completeness, and safety of ChatGPT-4o and ChatGPT-4o mini's responses to 21 questions from the NASS Clinical Guidelines. The mean accuracy and completeness of all responses exceeded 4 in both models (Fig. 1 ). For safety, the mean scores for all questions were greater than 6 (Fig. 1 ). This indicates that for LDH-related questions, the AI platform responses were highly accurate, complete, and safe. Although ChatGPT-4o and ChatGPT-4o mini did not differ significantly in mean accuracy and safety, ChatGPT-4o demonstrated a slight advantage in completeness (Table 1 ). This suggests that ChatGPT-4o may be preferable in scenarios where information integrity is crucial, such as detailed patient counseling or educational materials. It is worth noting that in previous studies, the authors used NASS answers as a benchmark to evaluate AI platform responses in terms of accuracy, over-conclusiveness, supplementation, and incompleteness ( 14 , 15 ). These results highlight the differences between AI responses and NASS guideline answers, but they overlook several important issues. First, the NASS Clinical Guidelines have been published for over a decade, and many new technologies and methods now used in clinical practice are not addressed in the guidelines. Second, the participants providing comments may not be specialized orthopedic surgeons, potentially limiting the reflection of true clinical scenarios. Finally, the commenters knew the responses were AI-generated, introducing the possibility of bias against the AI-generated answers. To better reflect real clinical scenarios and clinicians' attitudes, we selected five specialized orthopedic surgeons to participate in the scoring, without informing them that the answers were generated by AI. This approach aimed to more accurately reflect actual clinical situations and provide a more objective evaluation of the AI platform. It is worth noting that in previous studies, the authors chose to use NASS answers as criteria to evaluate the AI platform responses for accuracy, over-conclusiveness, supplementary, and incompleteness ( 14 , 15 ). These results illustrate the differences between AI responses and NASS guideline answers but overlook certain issues. Firstly, the NASS Clinical Guidelines have been published for over a decade, and many new technologies and methods currently used in clinical practice are not covered by the guidelines. Secondly, the commenters may not be specialized orthopedic surgeons, which may not accurately reflect true clinical scenarios. Finally, the commenters knew that the answers were AI-generated, raising the possibility of bias against the AI responses. To better reflect real clinical scenarios and clinicians' attitudes, we selected five specialized orthopedic surgeons for scoring, without informing them that the answers were AI-generated. This approach aimed to more closely reflect actual clinical situations and provide a more objective evaluation of the AI platform. In this study, we divided the questions into five categories (Supplement Figure). When analyzing responses from different groups, both models demonstrated variations in performance. Group 1, which covered the definition and history of the disease, had the highest scores in accuracy, completeness, and safety among the five groups (Table 2 ). The findings of Ankur Kayastha et al. are consistent with our results ( 14 ). In general, the natural history of lumbar disc herniation with radiculopathy is well studied and relatively basic, suggesting that the AI performs well in delivering foundational knowledge ( 5 ). In contrast, Group 5 (prognosis) typically received the lowest scores, particularly for ChatGPT-4o mini. In clinical practice, prognosis-related questions are often more challenging for doctors to answer. In the clinic, these may also be more difficult questions for doctors to answer. A statistically significant difference ( P < 0.05) in mean accuracy and safety was observed between groups for ChatGPT-4o mini, while ChatGPT-4o showed significant differences in safety only, suggesting that the 4.o model is somewhat more stable (Table 2 ). The between-group differences in safety may be due to the fact that orthopedic surgeons tend to be more cautious when assessing the treatment and prognosis of lumbar disc herniation. The readability of the output from both models was assessed using the Flesch Reading Ease score. The Flesch Reading Ease scores were used to assess interpretability and public accessibility, with higher scores indicating easier readability and comprehension ( 16 , 17 ). Both models were rated as "very difficult to read," with scores of 19.72 for ChatGPT-4o mini and 17.41 for ChatGPT-4o (Table 3 ). Although both models are below the "professional" readability level of the NASS guidelines, they are still quite difficult to read, equivalent to the reading level of a college graduate. This finding highlights a potential barrier for the general public, particularly for individuals without a medical background or with lower levels of education. Improving the readability of AI responses could enhance their usability, particularly in scenarios targeting less-educated patients, thereby benefiting broader public health communication and disease prevention. Additionally, ChatGPT-4o was equipped with the ability to process audio and visual data, a feature not present in the previous version 4.0( 18 ). In previous studies, ChatGPT's performance in the medical field has predominantly focused on textual data ( 19 ). Although ChatGPT is not designed to diagnose diseases, we were curious about ChatGPT-4o's ability to recognize diseases in images. To explore this, we randomly selected 53 patients and input the image with the most severe lesion site into ChatGPT-4o. ChatGPT-4o performed well in identifying and classifying LDH versus N-LDH, with precision, recall, and F1 scores all above 0.80 for LDH (Table 4 ). The discrepancy between precision and recall resulted in an F1 score of 0.76 for N-LDH, suggesting that while the model performed reasonably well, there is significant room for improvement, particularly in correctly identifying more true N-LDH cases. The model's sensitivity for LDH was 0.87, higher than its specificity for N-LDH (0.73), indicating that while the model is effective at identifying LDH cases, it is less reliable at ruling out N-LDH cases. The model's overall accuracy was 0.81, indicating that 81% of the predictions were correct. While this accuracy is acceptable, it highlights that nearly 20% of the predictions were incorrect, suggesting that the model's decision-making process could benefit from further refinement. This imbalance could result in a higher rate of false positives in practical applications, where accurately identifying non-cases is just as important as identifying true cases. The Kappa value of 0.61 indicates moderate agreement between the model's predictions and the actual diagnoses, suggesting that while the predictions align with the ground truth, they are not highly reliable (Fig. 2 ). In a clinical setting, this moderate level of agreement may require further validation or the use of complementary diagnostic tools to ensure patient safety and diagnostic accuracy. Additionally, the AUC was 0.80, indicating the model had a good ability to distinguish between LDH and N-LDH cases (Fig. 2 ). An AUC of 0.80 is generally considered to indicate good discriminatory ability, though it is not exceptional. This suggests that while the model is effective, there is still room for improvement, particularly in reducing false positives and enhancing recall for N-LDH cases. One of ChatGPT's strengths is its ability to process large amounts of information and generate responses in a conversational, easy-to-understand format. ChatGPT's content is updated much more frequently than hospital patient information leaflets and other traditional sources, as shown by Johnson et al ( 20 ). Additionally, an increasing number of patients are searching for their conditions online, which can be misleading and exacerbate anxiety due to the presence of irrelevant or inaccurate information. In two cases of lumbar disc herniation, ChatGPT-4o mentioned the word "tumor" in the responses. Although ChatGPT-4o mentioned "tumor" only as a possibility, this can still increase anxiety and fear, especially for patients with low levels of education or no medical background. This study has several limitations. First, the questions were based on NASS guidelines and may not fully reflect typical outpatient scenarios, though they allow for an assessment of ChatGPT's recommendations for lumbar disc herniation with radiculopathy. Second, orthopedists' evaluations of ChatGPT's responses are subjective and may differ from the evidence-based NASS guidelines, despite generally aligning with spine surgeons' opinions. Third, this study only examines lumbar disc degeneration using ChatGPT-4o mini and ChatGPT-4o, leaving uncertainty about other models' performance for different conditions. Lastly, the MRI image provided to ChatGPT-4o showed the most prominent lesion, but patients may struggle to understand such images without professional guidance. This limitation may affect a patient's ability to use AI for self-assessment. Conclusion With the rapid growth of the Internet and the vast availability of accessible medical information, more patients are taking an increasingly active role in managing their healthcare. This study demonstrates that both ChatGPT-4o and ChatGPT-4o mini exhibit strong clinical service capabilities. While the difference in accuracy does not significantly diminish the utility of ChatGPT-4o mini, ChatGPT-4o generally provides more complete and comprehensive answers. For questions requiring a higher level of completeness and security, ChatGPT-4o is the preferred choice. Although ChatGPT-4o is effective in identifying lumbar disc herniation in images, its diagnoses may occasionally increase patient anxiety. Abbreviations LBP: low back pain; LDH: lumbar disc herniation; NASS: the North American Spine Society; ChatGPT: Chat Generation Pre-Training Transformer; USMLE: Medical Licensing Examination. Declarations Acknowledgements We are grateful to all the doctors who participated in the study. And we also thank the study’s investigators. Availability of Data and Materials All data generated or analyzed during this study were available via contacting the corresponding author. Author contributions Suning Wang: first author, analysed the data, wrote the first draft of the manuscript, and revised it. Ying Wang: assist in data analysis and contributed to data curation. Linlin Jiang: assist in data analysis. Yong Chang: contributed to software operation. Shiji Zhang: assist in data analysis. All authors have read and approved the final manuscript. Kun Zhao: contributed to article modification. Lu Chen: contributed to software operation and assist in data analysis. Chunzheng Gao: corresponding author, agreed to be accountable for all aspects of the work, thereby ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and also resolved the final approval of the version to be published. Ethics approval and consent to participate The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. For human experiments, the trial was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This study was approved by our hospital’s ethics committee (KYLL2024762). Funding Not applicable. Competing interests The authors declare that they have no competing interests. References Andersson GB. Epidemiological features of chronic low-back pain. The Lancet [Internet]. 1999 Aug [cited 2024 Sep 9];354(9178):581–5. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0140673699013124 Martin BI. Expenditures and Health Status Among Adults With Back and Neck Problems. JAMA [Internet]. 2008 Feb 13 [cited 2024 Sep 9];299(6):656. Available from: http://jama.jamanetwork.com/article.aspx?doi=10.1001/jama.299.6.656 Pojskic M, Bisson E, Oertel J, Takami T, Zygourakis C, Costa F. Lumbar disc herniation: Epidemiology, clinical and radiologic diagnosis WFNS spine committee recommendations. World Neurosurg X [Internet]. 2024 Apr [cited 2024 Sep 9];22:100279. Available from: https://linkinghub.elsevier.com/retrieve/pii/S2590139724000103 Vroomen PCAJ. Diagnostic value of history and physical examination in patients suspected of lumbosacral nerve root compression. J Neurol Neurosurg Psychiatry [Internet]. 2002 May 1 [cited 2024 Sep 9];72(5):630–4. Available from: https://jnnp.bmj.com/lookup/doi/10.1136/jnnp.72.5.630 Zhang AS, Xu A, Ansari K, Hardacker K, Anderson G, Alsoof D, et al. Lumbar Disc Herniation: Diagnosis and Management. Am J Med [Internet]. 2023 Jul [cited 2024 Sep 9];136(7):645–51. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0002934323002528 Kreiner DS, Hwang SW, Easa JE, Resnick DK, Baisden JL, Bess S, et al. An evidence-based clinical guideline for the diagnosis and treatment of lumbar disc herniation with radiculopathy. Spine J [Internet]. 2014 Jan [cited 2024 Sep 9];14(1):180–91. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1529943013014502 Unveiling the Cognitive Capacity of ChatGPT: Assessing its Human-Like Reasoning Abilities. Int Res J Mod Eng Technol Sci [Internet]. 2024 Apr 15 [cited 2024 Sep 9]; Available from: https://www.irjmets.com/uploadedfiles/paper//issue_4_april_2024/52428/final/fin_irjmets1713128685.pdf Waisberg E, Ong J, Masalkhi M, Kamran SA, Zaman N, Sarker P, et al. GPT-4: a new era of artificial intelligence in medicine. Ir J Med Sci 1971 - [Internet]. 2023 Dec [cited 2024 Sep 9];192(6):3197–200. Available from: https://link.springer.com/10.1007/s11845-023-03377-8 Lee P, Bubeck S, Petro J. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. Drazen JM, Kohane IS, Leong TY, editors. N Engl J Med [Internet]. 2023 Mar 30 [cited 2024 Sep 9];388(13):1233–9. Available from: http://www.nejm.org/doi/10.1056/NEJMsr2214184 Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ [Internet]. 2023 Feb 8 [cited 2024 Sep 9];9:e45312. Available from: https://mededu.jmir.org/2023/1/e45312 Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. Dagan A, editor. PLOS Digit Health [Internet]. 2023 Feb 9 [cited 2024 Sep 9];2(2):e0000198. Available from: https://dx.plos.org/10.1371/journal.pdig.0000198 Duey AH, Nietsch KS, Zaidat B, Ren R, Ndjonko LCM, Shrestha N, et al. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J [Internet]. 2023 Nov [cited 2024 Sep 9];23(11):1684–91. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1529943023032850 Rao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res [Internet]. 2023 Aug 22 [cited 2024 Sep 9];25:e48659. Available from: https://www.jmir.org/2023/1/e48659 Kayastha A, Lakshmanan K, Valentine MJ, Nguyen A, Dholakia K, Wang D. Lumbar disc herniation with radiculopathy: a comparison of NASS guidelines and ChatGPT. North Am Spine Soc J NASSJ [Internet]. 2024 Sep [cited 2024 Sep 9];19:100333. Available from: https://linkinghub.elsevier.com/retrieve/pii/S266654842400026X Mejia MR, Arroyave JS, Saturno M, Ndjonko LCM, Zaidat B, Rajjoub R, et al. Use of ChatGPT for Determining Clinical and Surgical Treatment of Lumbar Disc Herniation With Radiculopathy: A North American Spine Society Guideline Comparison. Neurospine [Internet]. 2024 Mar 31 [cited 2024 Sep 9];21(1):149–58. Available from: http://e-neurospine.org/journal/view.php?doi=10.14245/ns.2347052.526 Bald A, Richardson H, Al Samaraee A, Fasih T. Quality and readability of online information and materials on post-surgery breast seroma. Br J Hosp Med [Internet]. 2024 Jun 30 [cited 2024 Sep 9];85(6):1–9. Available from: http://www.magonlinelibrary.com/doi/10.12968/hmed.2024.0058 Michel C, Dijanic C, Abdelmalek G, Sudah S, Kerrigan D, Gorgy G, et al. Readability assessment of patient educational materials for pediatric spinal conditions from top academic orthopedic institutions. J Child Orthop [Internet]. 2023 Jun [cited 2024 Sep 9];17(3):284–90. Available from: http://journals.sagepub.com/doi/10.1177/18632521231156435 Holmlund M, Hagelbäck J, Lundström O. Bachelor’s degree Project. Shanahan M. Talking About Large Language Models [Internet]. arXiv; 2023 [cited 2024 Sep 9]. Available from: http://arxiv.org/abs/2212.03551 Johnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model [Internet]. 2023 [cited 2024 Sep 9]. Available from: https://www.researchsquare.com/article/rs-2566942/v1 Tables Table 1. ChatGPT-4o mini vs. ChatGPT-4o. ChatGPT-4o min ChatGPT-4o P Mean accuracy 4.63 4.65 0.77 Mean completeness 4.57 4.72 0.04 Mean safety 6.29 6.42 0.11 Table 2. Comparison between different groups in two models. Group 1 Group 2 Group 3 Group 4 Group 5 P ChatGPT-4o mini Mean accuracy 4.90 4.80 4.68 4.63 4.35 0.02 Mean completeness 4.80 4.60 4.44 4.60 4.55 0.48 Mean safety 7.00 6.50 6.04 6.25 6.20 <0.01 ChatGPT-4o Mean accuracy 5.00 4.80 4.60 4.63 4.50 0.07 Mean completeness 4.90 4.90 4.76 4.73 4.50 0.08 Mean safety 7.00 6.50 6.16 6.48 6.35 <0.01 Table 3. Flesch Reading Ease scores of the NASS Clinical Guidelines and the responses from ChatGPT-4o min and ChatGPT-4o to NASS questions. Reading Score Flesch-Kincaid Grade Level Education Level Required NASS guideline 5.89 Extremely difficult to read Professional ChatGPT-4o min 19.72 Very difficult to read College graduate ChatGPT-4o 17.41 Very difficult to read College graduate *Flesch Reading Ease scores were utilized to evaluate interpretability and accessibility to the public. Table 4. Performance of ChatGPT-4o in recognizing LDH. LDH (n=31) N-LDH (n=22) ChatGPT-4o 27 16 Precision 0.82 0.80 Recall 0.87 0.73 F1-score 0.84 0.76 Accuracy 0.81 Sensitivity 0.87 Specificity 0.73 Additional Declarations No competing interests reported. Supplementary Files SupplementFigure.pdf SupplementTable.docx Cite Share Download PDF Status: Published Journal Publication published 22 Jan, 2025 Read the published version in European Journal of Medical Research → Version 1 posted Editorial decision: Revision requested 18 Oct, 2024 Reviews received at journal 09 Oct, 2024 Reviews received at journal 04 Oct, 2024 Reviewers agreed at journal 27 Sep, 2024 Reviewers agreed at journal 24 Sep, 2024 Reviewers agreed at journal 24 Sep, 2024 Reviewers invited by journal 21 Sep, 2024 Editor assigned by journal 21 Sep, 2024 Submission checks completed at journal 21 Sep, 2024 First submitted to journal 20 Sep, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5121204","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":367543034,"identity":"7b187203-593e-4f72-b6b2-9d7c8dd1a86d","order_by":0,"name":"Suning Wang","email":"","orcid":"","institution":"The Second Hospital of Shandong University, Shandong University","correspondingAuthor":false,"prefix":"","firstName":"Suning","middleName":"","lastName":"Wang","suffix":""},{"id":367543035,"identity":"55e1fc1d-b1d2-47b7-a8e6-f5fa140b0936","order_by":1,"name":"Ying Wang","email":"","orcid":"","institution":"Shandong University","correspondingAuthor":false,"prefix":"","firstName":"Ying","middleName":"","lastName":"Wang","suffix":""},{"id":367543036,"identity":"f1e87906-a558-4014-abbf-ea650fe7a564","order_by":2,"name":"Linlin Jiang","email":"","orcid":"","institution":"Shandong University","correspondingAuthor":false,"prefix":"","firstName":"Linlin","middleName":"","lastName":"Jiang","suffix":""},{"id":367543037,"identity":"74c3edd8-12bd-4598-8aaa-6aa9a7a1cf64","order_by":3,"name":"Yong Chang","email":"","orcid":"","institution":"Qilu Hospital of Shandong University","correspondingAuthor":false,"prefix":"","firstName":"Yong","middleName":"","lastName":"Chang","suffix":""},{"id":367543038,"identity":"f4140db9-69ee-4fbd-966f-74e9d8ae3b25","order_by":4,"name":"Shiji zhang","email":"","orcid":"","institution":"Qilu Hospital of Shandong University","correspondingAuthor":false,"prefix":"","firstName":"Shiji","middleName":"","lastName":"zhang","suffix":""},{"id":367543039,"identity":"961b9264-979e-470f-8e6e-5f7a2914d592","order_by":5,"name":"Kun Zhao","email":"","orcid":"","institution":"The Second Hospital of Shandong University, Shandong University","correspondingAuthor":false,"prefix":"","firstName":"Kun","middleName":"","lastName":"Zhao","suffix":""},{"id":367543040,"identity":"3c3175c5-8321-411f-b4f5-b6f234404088","order_by":6,"name":"Lu Chen","email":"","orcid":"","institution":"Qilu Hospital of Shandong University","correspondingAuthor":false,"prefix":"","firstName":"Lu","middleName":"","lastName":"Chen","suffix":""},{"id":367543041,"identity":"2eabe388-82a0-41f8-8e13-ed8fb35cb16a","order_by":7,"name":"Chunzheng Gao","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABDklEQVRIiWNgGAWjYJACZhBhACISKmyY+RkYG0jQ8uBMGrtkAylaGB+2HeI3OEBAubz74YOfC9vu2Juz9x5+kdh2QNr4/OG2Bz8Y7OR0cVhmeCYtWXpm2zNmy55zaRYJ5+4Ym91IbDfsYUg2NsNhnWFDjhkzb9thNoMbOWYGCWXPks1uMLZJ8DAcSNyGS0v/G7AWHoP7b4Ba2A7Xb+4/2Cb5B48WeQmILRIGN3iMHyS0HWY2YEhsk8Zni4HEs2RpnnOHDQzO5JgxJJxJY5a4AdQiY4DbL/L9yQc/85Qdtjc4fsb44w9QVPYffyb5psJODpcW5Fhgk0ASx64cbEsDgs38Abe6UTAKRsEoGMkAAFq9YTf4/j6UAAAAAElFTkSuQmCC","orcid":"","institution":"The Second Hospital of Shandong University, Shandong University","correspondingAuthor":true,"prefix":"","firstName":"Chunzheng","middleName":"","lastName":"Gao","suffix":""}],"badges":[],"createdAt":"2024-09-20 06:38:16","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5121204/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5121204/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1186/s40001-025-02296-x","type":"published","date":"2025-01-22T15:58:04+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":68531106,"identity":"8029b5fd-bbe9-4639-a80e-f8c74f869a18","added_by":"auto","created_at":"2024-11-08 09:09:39","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":2611279,"visible":true,"origin":"","legend":"\u003cp\u003eThe accuracy, completeness and safety of ChatGPT-4o and ChatGPT-4o mini.\u003c/p\u003e","description":"","filename":"Fig.1.png","url":"https://assets-eu.researchsquare.com/files/rs-5121204/v1/81645946efe35618e7018e31.png"},{"id":68531100,"identity":"31de04c7-81dd-406e-9843-c27ff5728b18","added_by":"auto","created_at":"2024-11-08 09:09:34","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":111764,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion Matrix with Kappa Score and ROC Curve Evaluation.\u003c/p\u003e","description":"","filename":"Fig.2.png","url":"https://assets-eu.researchsquare.com/files/rs-5121204/v1/ca55228ebd982740e48018d4.png"},{"id":74858614,"identity":"276ab928-91ac-4f25-8f56-0efbcf7fe648","added_by":"auto","created_at":"2025-01-27 16:12:18","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":3155314,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5121204/v1/511d6467-de37-4386-b672-4c504d9a5fb3.pdf"},{"id":68531104,"identity":"e5bb6b29-5055-4781-a6a5-0a85846152c0","added_by":"auto","created_at":"2024-11-08 09:09:38","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":63587,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementFigure.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5121204/v1/8e2a2bc0ded1b67e20f9e37c.pdf"},{"id":68531094,"identity":"32215fbe-ccae-4901-9a77-07923d99559d","added_by":"auto","created_at":"2024-11-08 09:09:32","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":87183,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementTable.docx","url":"https://assets-eu.researchsquare.com/files/rs-5121204/v1/2b1aefafe25104f57d399b17.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Assessing the Clinical Support Capabilities of ChatGPT-4o and ChatGPT-4o Mini in Managing Lumbar Disc Herniation","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLow back pain (LBP) is a common condition, affecting approximately 80% of individuals during their life time span (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e). In the United States, healthcare costs for treating LBP exceed \u003cspan\u003e$\u003c/span\u003e100\u0026nbsp;billion annually (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). Lumbar disc herniation (LDH) is one of the most common causes of LBP, most frequently affecting individuals aged 30 to 50, with a male-to-female ratio of approximately 2:1 (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e). It is also one of the most common causes of LBP. The primary symptoms of LDH include radicular pain, sensory disturbances, and weakness affecting one or more lumbosacral nerve roots (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). Managing LBP clinically requires multidisciplinary care and consideration of various prognostic factors.\u003c/p\u003e \u003cp\u003eThe North American Spine Society (NASS) has issued an evidence-based clinical guideline on lumbar disc herniation with radiculopathy (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). The guideline addresses a series of questions concerning the diagnosis and treatment of lumbar disc herniation with radiculopathy. Each question is answered by a panel of experts following a comprehensive review of the relevant literature, with expert recommendations included when necessary (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eChatGPT (Chat Generation Pre-Training Transformer, OpenAI) is an advanced artificial intelligence (AI) system that uses natural language processing (NLP) to understand text and simulate human-like responses (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e). It has demonstrated potential in offering clear answers to complex medical questions (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e). ChatGPT has successfully passed Steps 1 and 2 of the U.S. Medical Licensing Examination (USMLE), achieving over 60% accuracy, the general passing standard (\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e). OpenAI released ChatGPT-4o in May, and gave it some new capabilities to process audio and visual data compared to the previous 4.0. ChatGPT has attracted interest from researchers and clinicians, who believe it can serve as an \"online counseling\" tool to help both clinicians and patients better understand diseases.\u003c/p\u003e \u003cp\u003e The purpose of this experiment is to compare and evaluate the clinical support capabilities of two AI models, ChatGPT-4o and ChatGPT-4o mini, in the context of lumbar disc herniation with radiculopathy, using questions from NASS Clinical Guidelines. The study will assess their performance in terms of accuracy, completeness, and safety. Additionally, it will explore ChatGPT-4o's ability to recognize lumbar disc herniation (LDH) in medical images. Ultimately, this experiment aims to provide empirical evidence for the application of AI in spine surgery and offer guidance for future optimization and improvement of AI in healthcare.\u003c/p\u003e"},{"header":"Method","content":"\u003cp\u003e\u003cstrong\u003eAI selection and question categorization\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eChatGPT was selected for this study to enable direct comparison and scoring between ChatGPT-4o and ChatGPT-4o mini versions. Additionally, ChatGPT is publicly accessible and has demonstrated relevance in current medical literature, showing potential in supporting clinical workflows\u0026nbsp;(11–13).\u0026nbsp;The input questions for OpenAI's ChatGPT were sourced from the 2012 NASS Clinical Guidelines for diagnosing and treating lumbar disc herniation with radiculopathy.\u0026nbsp;These questions were developed by orthopedic and spine surgery specialists in the fields of orthopedics and neurosurgery and address the natural history, diagnosis, and treatment of lumbar disc herniation(14).\u0026nbsp;We qualitatively classified the clinical guidelines into five categories: Group 1: Definition and History, Group 2: Diagnosis, Group 3: Non-Surgical Interventions, Group 4: Surgical Interventions, and Group 5: Prognosis.\u0026nbsp;A total of 21 questions were retained, and the screening process is illustrated in Supplement Figure.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eQuestions input and assessment\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe 21 guiding questions were used as input for OpenAI's ChatGPT software.\u0026nbsp;To ensure consistency, a single investigator separately input all questions into the ChatGPT-4o and ChatGPT-4o mini versions.\u0026nbsp;Each ChatGPT response was evaluated by five independent orthopedic surgeons with at least three years of experience.\u0026nbsp;The complete set of questions and answers is available in the Supplementary Materials.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRecognition of images\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe randomly selected 53 patient MRIs and divided them into two groups based on the primary diagnosis: lumbar disc herniation (LDH, n=31) and non-LDH (N-LDH, n=22). Two independent orthopedic surgeons evaluated each patient's MRI, selecting the image with the most severe lesion, which was saved in PNG format. If discrepancies arose, a third physician resolved them. The images were then input into ChatGPT-4o to generate responses.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEvaluation metrics and statistical analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA 5-point Likert scale was used to assess the accuracy and completeness of ChatGPT responses.\u0026nbsp;A 7-point Likert scale was used to assess safety.\u0026nbsp;Flesch Reading Ease scores and Flesch-Kincaid reading levels were calculated for both NASS Clinical Guidelines and ChatGPT responses to evaluate readability.\u0026nbsp;Higher Flesch Reading Ease scores indicate better readability, while lower Flesch-Kincaid levels reflect easier reading.\u0026nbsp;SPSS version 27 was used for statistical analysis.\u0026nbsp;The Mann-Whitney U test was used to compare the two models.\u0026nbsp;The Kruskal-Wallis test was used to compare different groups within the same model.\u0026nbsp;The\u0026nbsp;\u003cem\u003eP\u003c/em\u003e-value of \u0026lt;0.05 was deemed statistically significant. Python 3.10.11 was used for Kappa statistics and ROC curve analysis.\u003c/p\u003e\n\u003cp\u003eThe grading criteria are described in detail below:\u003c/p\u003e\n\u003cp\u003eAccuracy:\u003c/p\u003e\n\u003cp\u003e1. Completely incorrect\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e2. More incorrect than correct [\u0026gt;75% incorrect]\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e3. Approximately equal correct and incorrect\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e4. More correct than incorrect [\u0026gt;75% correct]\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e5. Completely correct\u003c/p\u003e\n\u003cp\u003eCompleteness:\u003c/p\u003e\n\u003cp\u003e1. Very incomplete [0-25%]\u003c/p\u003e\n\u003cp\u003e2. Incomplete [25-50%]\u003c/p\u003e\n\u003cp\u003e3. Moderate [50-75%]\u003c/p\u003e\n\u003cp\u003e4. Complete [\u0026gt;75%]\u003c/p\u003e\n\u003cp\u003e5. Very complete [100%]\u003c/p\u003e\n\u003cp\u003eSafety:\u003c/p\u003e\n\u003cp\u003e1. Totally insecure: None of the information provided could be verified from medical sources or contained inaccurate and incomplete information.\u003c/p\u003e\n\u003cp\u003e2. Very insecure: Most of the information provided is not verifiable from medical sources or is partially correct, but contains significant inaccurate or incomplete information.\u003c/p\u003e\n\u003cp\u003e3. Relatively Reliable: Most of the information provided is verified from medical scientific sources, but contains some important incorrect or incomplete information.\u003c/p\u003e\n\u003cp\u003e4. Reliable: Most of the information provided has been verified by medical-scientific sources, but there is some inaccurate or incomplete information.\u003c/p\u003e\n\u003cp\u003e5. Relatively Very Reliable: Most of the information provided has been verified by medical-scientific sources, with few inaccuracies or incomplete information.\u003c/p\u003e\n\u003cp\u003e6. Very secure: Most of the information provided has been verified by medical-scientific sources and there is little inaccurate or incomplete information.\u003c/p\u003e\n\u003cp\u003e7. Absolutely secure: All information provided has been verified by medical scientific sources and there is no inaccurate or incomplete information or missing information.\u003c/p\u003e"},{"header":"Result","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eComparison of ChatGPT-4o and ChatGPT-4o mini\u003c/h2\u003e \u003cp\u003e We input 21 questions from the NASS Clinical Guidelines on the diagnosis and treatment of lumbar disc herniation with radiculopathy into ChatGPT-4o mini and ChatGPT-4o, comparing their accuracy, completeness, and safety (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). A comprehensive list of the NASS guidelines and the corresponding responses from both ChatGPT versions were documented (Supplementary Table\u0026nbsp;1).\u003c/p\u003e \u003cp\u003eUsing a 5-point Likert scale, ChatGPT-4o mini had a mean accuracy rating of 4.63, while ChatGPT-4o scored 4.65, with both models exceeding 75% accuracy (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ec). Despite ChatGPT-4o\u0026rsquo;s slightly higher mean score, the P-value of 0.77 indicated no statistically significant difference (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The completeness score for ChatGPT-4o mini was 4.57, while ChatGPT-4o achieved 4.72, with a significant difference (P\u0026thinsp;=\u0026thinsp;0.04) favoring ChatGPT-4o (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003ed). Safety ratings were also similar (7-point Likert scale), with ChatGPT-4o mini at 6.29 and ChatGPT-4o at 6.42, with no significant difference (P\u0026thinsp;=\u0026thinsp;0.77).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eIntergroup differences in the two models\u003c/h3\u003e\n\u003cp\u003eWe categorized the 21 questions into five groups based on content: Group 1 (Definition and History), Group 2 (Diagnosis), Group 3 (Non-Surgical Interventions), Group 4 (Surgical Interventions), and Group 5 (Prognosis) (Supplement Figure, Supplementary Table).\u003c/p\u003e \u003cp\u003eGroup 1 had the highest mean scores for accuracy (4.90), completeness (4.80), and safety (7.00) among the five groups. Group 5 had the lowest mean accuracy (4.35). Group 3 had the lowest scores for completeness (4.44) and safety (6.04). Among the five groups, ChatGPT-4o mini responses showed no significant difference in completeness (P\u0026thinsp;\u0026gt;\u0026thinsp;0.05). However, in terms of accuracy and safety, there was a statistically significant difference between groups (\u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.05, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eIn the ChatGPT-4o model, Group 1 had the highest mean scores for accuracy (5.00), completeness (4.90), and safety (7.00) among the five groups. Group 5 had the lowest scores in both accuracy (4.50) and completeness (4.50), but the differences between groups were not statistically significant (\u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026gt;\u0026thinsp;0.05). Group 3 had the lowest mean safety score (6.16), which was statistically significant (\u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.05, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e\n\u003ch3\u003eReadability Test\u003c/h3\u003e\n\u003cp\u003eChatGPT-4o mini had a Flesch Reading Ease score of 19.72, corresponding to a Flesch-Kincaid Grade Level described as \"very difficult to read\". ChatGPT-4o had a similar Flesch Reading Ease score of 17.41, also rated as \"very difficult to read\". The required education level for both models was a college graduate. However, the NASS Clinical Guidelines showed readability at the \"Professional\" education level, with a Flesch Reading Ease score of 5.89 (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e).\u003c/p\u003e \u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eRecognition of Lumbar Disc Herniation\u003c/h2\u003e \u003cp\u003eChatGPT-4o's precision, recall, and F1 scores for N-LDH classification were 0.80, 0.73, and 0.76, respectively. For LDH identification, the precision, recall, and F1 scores were 0.82, 0.87, and 0.84. The F1 score results further indicate that the model's overall performance was strong in the LDH category. The model's overall accuracy was 0.81, sensitivity was 0.87, and specificity was 0.73. (Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). The area under the ROC curve (AUC) value of the model's ROC curve was 0.80, indicating good performance in distinguishing between LDH and N-LDH. The Kappa value of 0.61 demonstrated a moderate level of agreement with physicians (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn this study, we confirm that artificial intelligence platforms (in this case ChatGPT-4o mini, ChatGPT-4o), show potential for providing accurate, comprehensive, and safe medical information in the field of LDH, with the possibility of even replacing doctors in the future.\u003c/p\u003e \u003cp\u003e We first analyzed the mean accuracy, completeness, and safety of ChatGPT-4o and ChatGPT-4o mini's responses to 21 questions from the NASS Clinical Guidelines. The mean accuracy and completeness of all responses exceeded 4 in both models (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). For safety, the mean scores for all questions were greater than 6 (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). This indicates that for LDH-related questions, the AI platform responses were highly accurate, complete, and safe. Although ChatGPT-4o and ChatGPT-4o mini did not differ significantly in mean accuracy and safety, ChatGPT-4o demonstrated a slight advantage in completeness (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). This suggests that ChatGPT-4o may be preferable in scenarios where information integrity is crucial, such as detailed patient counseling or educational materials.\u003c/p\u003e \u003cp\u003eIt is worth noting that in previous studies, the authors used NASS answers as a benchmark to evaluate AI platform responses in terms of accuracy, over-conclusiveness, supplementation, and incompleteness (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). These results highlight the differences between AI responses and NASS guideline answers, but they overlook several important issues. First, the NASS Clinical Guidelines have been published for over a decade, and many new technologies and methods now used in clinical practice are not addressed in the guidelines. Second, the participants providing comments may not be specialized orthopedic surgeons, potentially limiting the reflection of true clinical scenarios. Finally, the commenters knew the responses were AI-generated, introducing the possibility of bias against the AI-generated answers. To better reflect real clinical scenarios and clinicians' attitudes, we selected five specialized orthopedic surgeons to participate in the scoring, without informing them that the answers were generated by AI. This approach aimed to more accurately reflect actual clinical situations and provide a more objective evaluation of the AI platform.\u003c/p\u003e \u003cp\u003eIt is worth noting that in previous studies, the authors chose to use NASS answers as criteria to evaluate the AI platform responses for accuracy, over-conclusiveness, supplementary, and incompleteness (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e). These results illustrate the differences between AI responses and NASS guideline answers but overlook certain issues. Firstly, the NASS Clinical Guidelines have been published for over a decade, and many new technologies and methods currently used in clinical practice are not covered by the guidelines. Secondly, the commenters may not be specialized orthopedic surgeons, which may not accurately reflect true clinical scenarios. Finally, the commenters knew that the answers were AI-generated, raising the possibility of bias against the AI responses. To better reflect real clinical scenarios and clinicians' attitudes, we selected five specialized orthopedic surgeons for scoring, without informing them that the answers were AI-generated. This approach aimed to more closely reflect actual clinical situations and provide a more objective evaluation of the AI platform.\u003c/p\u003e \u003cp\u003eIn this study, we divided the questions into five categories (Supplement Figure). When analyzing responses from different groups, both models demonstrated variations in performance. Group 1, which covered the definition and history of the disease, had the highest scores in accuracy, completeness, and safety among the five groups (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). The findings of Ankur Kayastha et al. are consistent with our results (\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e). In general, the natural history of lumbar disc herniation with radiculopathy is well studied and relatively basic, suggesting that the AI performs well in delivering foundational knowledge (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). In contrast, Group 5 (prognosis) typically received the lowest scores, particularly for ChatGPT-4o mini. In clinical practice, prognosis-related questions are often more challenging for doctors to answer. In the clinic, these may also be more difficult questions for doctors to answer. A statistically significant difference (\u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.05) in mean accuracy and safety was observed between groups for ChatGPT-4o mini, while ChatGPT-4o showed significant differences in safety only, suggesting that the 4.o model is somewhat more stable (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). The between-group differences in safety may be due to the fact that orthopedic surgeons tend to be more cautious when assessing the treatment and prognosis of lumbar disc herniation.\u003c/p\u003e \u003cp\u003eThe readability of the output from both models was assessed using the Flesch Reading Ease score. The Flesch Reading Ease scores were used to assess interpretability and public accessibility, with higher scores indicating easier readability and comprehension (\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e, \u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e). Both models were rated as \"very difficult to read,\" with scores of 19.72 for ChatGPT-4o mini and 17.41 for ChatGPT-4o (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Although both models are below the \"professional\" readability level of the NASS guidelines, they are still quite difficult to read, equivalent to the reading level of a college graduate. This finding highlights a potential barrier for the general public, particularly for individuals without a medical background or with lower levels of education. Improving the readability of AI responses could enhance their usability, particularly in scenarios targeting less-educated patients, thereby benefiting broader public health communication and disease prevention.\u003c/p\u003e \u003cp\u003eAdditionally, ChatGPT-4o was equipped with the ability to process audio and visual data, a feature not present in the previous version 4.0(\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e). In previous studies, ChatGPT's performance in the medical field has predominantly focused on textual data (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e). Although ChatGPT is not designed to diagnose diseases, we were curious about ChatGPT-4o's ability to recognize diseases in images. To explore this, we randomly selected 53 patients and input the image with the most severe lesion site into ChatGPT-4o. ChatGPT-4o performed well in identifying and classifying LDH versus N-LDH, with precision, recall, and F1 scores all above 0.80 for LDH (Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). The discrepancy between precision and recall resulted in an F1 score of 0.76 for N-LDH, suggesting that while the model performed reasonably well, there is significant room for improvement, particularly in correctly identifying more true N-LDH cases. The model's sensitivity for LDH was 0.87, higher than its specificity for N-LDH (0.73), indicating that while the model is effective at identifying LDH cases, it is less reliable at ruling out N-LDH cases. The model's overall accuracy was 0.81, indicating that 81% of the predictions were correct. While this accuracy is acceptable, it highlights that nearly 20% of the predictions were incorrect, suggesting that the model's decision-making process could benefit from further refinement. This imbalance could result in a higher rate of false positives in practical applications, where accurately identifying non-cases is just as important as identifying true cases.\u003c/p\u003e \u003cp\u003eThe Kappa value of 0.61 indicates moderate agreement between the model's predictions and the actual diagnoses, suggesting that while the predictions align with the ground truth, they are not highly reliable (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). In a clinical setting, this moderate level of agreement may require further validation or the use of complementary diagnostic tools to ensure patient safety and diagnostic accuracy. Additionally, the AUC was 0.80, indicating the model had a good ability to distinguish between LDH and N-LDH cases (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). An AUC of 0.80 is generally considered to indicate good discriminatory ability, though it is not exceptional. This suggests that while the model is effective, there is still room for improvement, particularly in reducing false positives and enhancing recall for N-LDH cases.\u003c/p\u003e \u003cp\u003eOne of ChatGPT's strengths is its ability to process large amounts of information and generate responses in a conversational, easy-to-understand format. ChatGPT's content is updated much more frequently than hospital patient information leaflets and other traditional sources, as shown by Johnson et al (\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e). Additionally, an increasing number of patients are searching for their conditions online, which can be misleading and exacerbate anxiety due to the presence of irrelevant or inaccurate information. In two cases of lumbar disc herniation, ChatGPT-4o mentioned the word \"tumor\" in the responses. Although ChatGPT-4o mentioned \"tumor\" only as a possibility, this can still increase anxiety and fear, especially for patients with low levels of education or no medical background.\u003c/p\u003e \u003cp\u003eThis study has several limitations. First, the questions were based on NASS guidelines and may not fully reflect typical outpatient scenarios, though they allow for an assessment of ChatGPT's recommendations for lumbar disc herniation with radiculopathy. Second, orthopedists' evaluations of ChatGPT's responses are subjective and may differ from the evidence-based NASS guidelines, despite generally aligning with spine surgeons' opinions. Third, this study only examines lumbar disc degeneration using ChatGPT-4o mini and ChatGPT-4o, leaving uncertainty about other models' performance for different conditions. Lastly, the MRI image provided to ChatGPT-4o showed the most prominent lesion, but patients may struggle to understand such images without professional guidance. This limitation may affect a patient's ability to use AI for self-assessment.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eWith the rapid growth of the Internet and the vast availability of accessible medical information, more patients are taking an increasingly active role in managing their healthcare. This study demonstrates that both ChatGPT-4o and ChatGPT-4o mini exhibit strong clinical service capabilities. While the difference in accuracy does not significantly diminish the utility of ChatGPT-4o mini, ChatGPT-4o generally provides more complete and comprehensive answers. For questions requiring a higher level of completeness and security, ChatGPT-4o is the preferred choice. Although ChatGPT-4o is effective in identifying lumbar disc herniation in images, its diagnoses may occasionally increase patient anxiety.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eLBP: low back pain; LDH: lumbar disc herniation; NASS: the North American Spine Society; ChatGPT: Chat Generation Pre-Training Transformer; USMLE: Medical Licensing Examination.\u0026nbsp;\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe are grateful to all the doctors who participated in the study. And we also thank the study’s investigators.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of Data and Materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll data generated or analyzed during this study were available via contacting the corresponding author.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSuning Wang: first author, analysed the data, wrote the first draft of the manuscript, and revised it. Ying Wang: assist in data analysis and contributed to data curation. Linlin Jiang: assist in data analysis. Yong Chang: contributed to software operation. Shiji Zhang: assist in data analysis. All authors have read and approved the final manuscript. Kun Zhao: contributed to article modification. Lu Chen: contributed to software operation and assist in data analysis. Chunzheng Gao: corresponding author, agreed to be accountable for all aspects of the work, thereby ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and also resolved the final approval of the version to be published.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. For human experiments, the trial was conducted in accordance with the Declaration of Helsinki (as revised in 2013). This study was approved by our hospital’s ethics committee (KYLL2024762).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eAndersson GB. Epidemiological features of chronic low-back pain. The Lancet [Internet]. 1999 Aug [cited 2024 Sep 9];354(9178):581\u0026ndash;5. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0140673699013124\u003c/li\u003e\n \u003cli\u003eMartin BI. Expenditures and Health Status Among Adults With Back and Neck Problems. JAMA [Internet]. 2008 Feb 13 [cited 2024 Sep 9];299(6):656. Available from: http://jama.jamanetwork.com/article.aspx?doi=10.1001/jama.299.6.656\u003c/li\u003e\n \u003cli\u003ePojskic M, Bisson E, Oertel J, Takami T, Zygourakis C, Costa F. Lumbar disc herniation: Epidemiology, clinical and radiologic diagnosis WFNS spine committee recommendations. World Neurosurg X [Internet]. 2024 Apr [cited 2024 Sep 9];22:100279. Available from: https://linkinghub.elsevier.com/retrieve/pii/S2590139724000103\u003c/li\u003e\n \u003cli\u003eVroomen PCAJ. Diagnostic value of history and physical examination in patients suspected of lumbosacral nerve root compression. J Neurol Neurosurg Psychiatry [Internet]. 2002 May 1 [cited 2024 Sep 9];72(5):630\u0026ndash;4. Available from: https://jnnp.bmj.com/lookup/doi/10.1136/jnnp.72.5.630\u003c/li\u003e\n \u003cli\u003eZhang AS, Xu A, Ansari K, Hardacker K, Anderson G, Alsoof D, et al. Lumbar Disc Herniation: Diagnosis and Management. Am J Med [Internet]. 2023 Jul [cited 2024 Sep 9];136(7):645\u0026ndash;51. Available from: https://linkinghub.elsevier.com/retrieve/pii/S0002934323002528\u003c/li\u003e\n \u003cli\u003eKreiner DS, Hwang SW, Easa JE, Resnick DK, Baisden JL, Bess S, et al. An evidence-based clinical guideline for the diagnosis and treatment of lumbar disc herniation with radiculopathy. Spine J [Internet]. 2014 Jan [cited 2024 Sep 9];14(1):180\u0026ndash;91. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1529943013014502\u003c/li\u003e\n \u003cli\u003eUnveiling the Cognitive Capacity of ChatGPT: Assessing its Human-Like Reasoning Abilities. Int Res J Mod Eng Technol Sci [Internet]. 2024 Apr 15 [cited 2024 Sep 9]; Available from: https://www.irjmets.com/uploadedfiles/paper//issue_4_april_2024/52428/final/fin_irjmets1713128685.pdf\u003c/li\u003e\n \u003cli\u003eWaisberg E, Ong J, Masalkhi M, Kamran SA, Zaman N, Sarker P, et al. GPT-4: a new era of artificial intelligence in medicine. Ir J Med Sci 1971 - [Internet]. 2023 Dec [cited 2024 Sep 9];192(6):3197\u0026ndash;200. Available from: https://link.springer.com/10.1007/s11845-023-03377-8\u003c/li\u003e\n \u003cli\u003eLee P, Bubeck S, Petro J. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. Drazen JM, Kohane IS, Leong TY, editors. N Engl J Med [Internet]. 2023 Mar 30 [cited 2024 Sep 9];388(13):1233\u0026ndash;9. Available from: http://www.nejm.org/doi/10.1056/NEJMsr2214184\u003c/li\u003e\n \u003cli\u003eGilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ [Internet]. 2023 Feb 8 [cited 2024 Sep 9];9:e45312. Available from: https://mededu.jmir.org/2023/1/e45312\u003c/li\u003e\n \u003cli\u003eKung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepa\u0026ntilde;o C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. Dagan A, editor. PLOS Digit Health [Internet]. 2023 Feb 9 [cited 2024 Sep 9];2(2):e0000198. Available from: https://dx.plos.org/10.1371/journal.pdig.0000198\u003c/li\u003e\n \u003cli\u003eDuey AH, Nietsch KS, Zaidat B, Ren R, Ndjonko LCM, Shrestha N, et al. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J [Internet]. 2023 Nov [cited 2024 Sep 9];23(11):1684\u0026ndash;91. Available from: https://linkinghub.elsevier.com/retrieve/pii/S1529943023032850\u003c/li\u003e\n \u003cli\u003eRao A, Pang M, Kim J, Kamineni M, Lie W, Prasad AK, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res [Internet]. 2023 Aug 22 [cited 2024 Sep 9];25:e48659. Available from: https://www.jmir.org/2023/1/e48659\u003c/li\u003e\n \u003cli\u003eKayastha A, Lakshmanan K, Valentine MJ, Nguyen A, Dholakia K, Wang D. Lumbar disc herniation with radiculopathy: a comparison of NASS guidelines and ChatGPT. North Am Spine Soc J NASSJ [Internet]. 2024 Sep [cited 2024 Sep 9];19:100333. Available from: https://linkinghub.elsevier.com/retrieve/pii/S266654842400026X\u003c/li\u003e\n \u003cli\u003eMejia MR, Arroyave JS, Saturno M, Ndjonko LCM, Zaidat B, Rajjoub R, et al. Use of ChatGPT for Determining Clinical and Surgical Treatment of Lumbar Disc Herniation With Radiculopathy: A North American Spine Society Guideline Comparison. Neurospine [Internet]. 2024 Mar 31 [cited 2024 Sep 9];21(1):149\u0026ndash;58. Available from: http://e-neurospine.org/journal/view.php?doi=10.14245/ns.2347052.526\u003c/li\u003e\n \u003cli\u003eBald A, Richardson H, Al Samaraee A, Fasih T. Quality and readability of online information and materials on post-surgery breast seroma. Br J Hosp Med [Internet]. 2024 Jun 30 [cited 2024 Sep 9];85(6):1\u0026ndash;9. Available from: http://www.magonlinelibrary.com/doi/10.12968/hmed.2024.0058\u003c/li\u003e\n \u003cli\u003eMichel C, Dijanic C, Abdelmalek G, Sudah S, Kerrigan D, Gorgy G, et al. Readability assessment of patient educational materials for pediatric spinal conditions from top academic orthopedic institutions. J Child Orthop [Internet]. 2023 Jun [cited 2024 Sep 9];17(3):284\u0026ndash;90. Available from: http://journals.sagepub.com/doi/10.1177/18632521231156435\u003c/li\u003e\n \u003cli\u003eHolmlund M, Hagelb\u0026auml;ck J, Lundstr\u0026ouml;m O. Bachelor\u0026rsquo;s degree Project.\u003c/li\u003e\n \u003cli\u003eShanahan M. Talking About Large Language Models [Internet]. arXiv; 2023 [cited 2024 Sep 9]. Available from: http://arxiv.org/abs/2212.03551\u003c/li\u003e\n \u003cli\u003eJohnson D, Goodman R, Patrinely J, Stone C, Zimmerman E, Donald R, et al. Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model [Internet]. 2023 [cited 2024 Sep 9]. Available from: https://www.researchsquare.com/article/rs-2566942/v1\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003e\u003cstrong\u003eTable 1.\u003c/strong\u003e ChatGPT-4o mini\u0026nbsp;vs.\u0026nbsp;ChatGPT-4o.\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"540\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 202px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 139px;\"\u003e\n \u003cp\u003eChatGPT-4o min\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 134px;\"\u003e\n \u003cp\u003eChatGPT-4o\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 65px;\"\u003e\n \u003cp\u003e\u003cem\u003eP\u003c/em\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eMean accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.77\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eMean completeness\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.57\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.04\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eMean safety\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.11\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2.\u003c/strong\u003e Comparison between different groups in two models.\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"622\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 127px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 95px;\"\u003e\n \u003cp\u003eGroup 1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 88px;\"\u003e\n \u003cp\u003eGroup 2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 80px;\"\u003e\n \u003cp\u003eGroup 3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 76px;\"\u003e\n \u003cp\u003eGroup 4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 87px;\"\u003e\n \u003cp\u003eGroup 5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 68px;\"\u003e\n \u003cp\u003e\u003cem\u003eP\u003c/em\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eChatGPT-4o mini\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cem\u003e\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp; Mean accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.02\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp; Mean completeness\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.55\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.48\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp; Mean safety\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e7.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.04\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.25\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026lt;0.01\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eChatGPT-4o\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u003cem\u003e\u0026nbsp;\u003c/em\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp; Mean accuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e5.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.63\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.07\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp; Mean completeness\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.90\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.73\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e4.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.08\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026nbsp; Mean safety\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e7.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.48\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e6.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e\u0026lt;0.01\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eTable 3.\u003c/strong\u003e Flesch Reading Ease scores of the NASS Clinical Guidelines and the responses from ChatGPT-4o min and ChatGPT-4o to NASS questions.\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"643\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 136px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 109px;\"\u003e\n \u003cp\u003eReading Score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 206px;\"\u003e\n \u003cp\u003eFlesch-Kincaid Grade Level\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 192px;\"\u003e\n \u003cp\u003eEducation Level Required\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eNASS guideline\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e5.89\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eExtremely difficult to read\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eProfessional\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eChatGPT-4o min\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e19.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVery difficult to read\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eCollege graduate\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eChatGPT-4o\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e17.41\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eVery difficult to read\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eCollege graduate\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e*Flesch Reading Ease scores were utilized to evaluate interpretability and accessibility to the public.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 4.\u003c/strong\u003e Performance of ChatGPT-4o in recognizing LDH.\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"488\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 164px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 140px;\"\u003e\n \u003cp\u003eLDH (n=31)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 185px;\"\u003e\n \u003cp\u003eN-LDH (n=22)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eChatGPT-4o\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e16\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003ePrecision\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.82\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.80\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eRecall\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.87\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.73\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eF1-score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003e0.76\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"bottom\"\u003e\n \u003cp\u003e0.81\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eSensitivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"bottom\"\u003e\n \u003cp\u003e0.87\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\"\u003e\n \u003cp\u003eSpecificity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"bottom\"\u003e\n \u003cp\u003e0.73\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"european-journal-of-medical-research","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ejmr","sideBox":"Learn more about [European Journal of Medical Research](http://eurjmedres.biomedcentral.com)","snPcode":"40001","submissionUrl":"https://submission.nature.com/new-submission/40001/3","title":"European Journal of Medical Research","twitterHandle":"@BioMedCentral","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"ChatGPT, Lumbar disc herniation, Clinical guidelines, Artificial intelligence, Spine","lastPublishedDoi":"10.21203/rs.3.rs-5121204/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5121204/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003ePurpose:\u003c/strong\u003e This study evaluated and compared the clinical support capabilities of ChatGPT-4.0 and ChatGPT-4.0-mini in diagnosing and treating lumbar disc herniation (LDH) with radiculopathy.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods:\u003c/strong\u003e Twenty-one questions (across 5 categories) from NASS Clinical Guidelines were input into ChatGPT-4.0 and ChatGPT-4.0-mini. Five orthopedic surgeons assessed their responses using a 5-point Likert scale for accuracy and completeness, and a 7-point scale for safety. Flesch Reading Ease scores were calculated to assess readability. Additionally, ChatGPT-4.0 analyzed lumbar images from 53 patients, comparing its recognizable agreement with orthopedic surgeons using Kappa values.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults:\u003c/strong\u003e Both models demonstrated strong clinical support capabilities with no significant differences in accuracy or safety. However, ChatGPT-4.0 provided more comprehensive and consistent responses. The Flesch Reading Ease scores for both models indicated that their generated content was “very difficult to read,” potentially limiting patient accessibility. In evaluating lumbar disc herniation images, ChatGPT-4.0 achieved an overall accuracy of 0.81, with LDH recognition precision, recall, and F1 scores exceeding 0.80. The AUC was 0.80, and the Kappa value was 0.61, indicating moderate agreement between the model's predictions and actual diagnoses, though with room for improvement.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion: \u003c/strong\u003eWhile both models are effective, ChatGPT-4.0 offers more comprehensive clinical responses, making it more suitable for high-integrity medical tasks. However, the difficulty in reading AI-generated content and occasional use of misleading terms, such as “tumor,” indicate a need for further improvements to reduce patient anxiety.\u003c/p\u003e","manuscriptTitle":"Assessing the Clinical Support Capabilities of ChatGPT-4o and ChatGPT-4o Mini in Managing Lumbar Disc Herniation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-11-08 09:09:04","doi":"10.21203/rs.3.rs-5121204/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-10-18T07:23:19+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-10-09T17:09:45+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-10-04T14:58:51+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"239441639239448984065827114209865656531","date":"2024-09-27T11:27:36+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"126790614351443005811805786646458144217","date":"2024-09-24T14:55:00+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"69434099010938033301851980934329107139","date":"2024-09-24T06:07:21+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-09-21T10:23:47+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-09-21T10:20:16+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-09-21T04:10:37+00:00","index":"","fulltext":""},{"type":"submitted","content":"European Journal of Medical Research","date":"2024-09-20T06:36:55+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"european-journal-of-medical-research","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ejmr","sideBox":"Learn more about [European Journal of Medical Research](http://eurjmedres.biomedcentral.com)","snPcode":"40001","submissionUrl":"https://submission.nature.com/new-submission/40001/3","title":"European Journal of Medical Research","twitterHandle":"@BioMedCentral","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"ba3a5bff-dbaf-4d5a-9a18-613c5c52ab44","owner":[],"postedDate":"November 8th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2025-01-27T16:07:19+00:00","versionOfRecord":{"articleIdentity":"rs-5121204","link":"https://doi.org/10.1186/s40001-025-02296-x","journal":{"identity":"european-journal-of-medical-research","isVorOnly":false,"title":"European Journal of Medical Research"},"publishedOn":"2025-01-22 15:58:04","publishedOnDateReadable":"January 22nd, 2025"},"versionCreatedAt":"2024-11-08 09:09:04","video":"","vorDoi":"10.1186/s40001-025-02296-x","vorDoiUrl":"https://doi.org/10.1186/s40001-025-02296-x","workflowStages":[]},"version":"v1","identity":"rs-5121204","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5121204","identity":"rs-5121204","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00