ChatGPT Performance in a questionnaire on rheumatological diseases: A Comparison with Specialist’s Opinion

doi:10.21203/rs.3.rs-6484816/v1

ChatGPT Performance in a questionnaire on rheumatological diseases: A Comparison with Specialist’s Opinion

2025 · doi:10.21203/rs.3.rs-6484816/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 78,907 characters · extracted from preprint-html · click to expand

ChatGPT Performance in a questionnaire on rheumatological diseases: A Comparison with Specialist’s Opinion | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article ChatGPT Performance in a questionnaire on rheumatological diseases: A Comparison with Specialist’s Opinion Lucas Gonçalves, Carlos Antonio Moura This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6484816/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 9 You are reading this latest preprint version Abstract Background This study aims to compare the performance of Generative Pretrained Transformer Chat 4.0 (ChatGPT 4.0) with rheumatologists of varying experience levels in a questionnaire on systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), ankylosing spondylitis (AS), psoriatic arthritis (PsA), and fibromyalgia (FM). Methods In this cross-sectional study, a 25-question questionnaire (five questions per disease) was administered to ChatGPT 4.0 and four pairs of rheumatologists with different experience levels (less than 5 years, 5–10 years, 11–20 years, and 21–30 years). Two rheumatologists with more than 30 years of experience and linked to academic services blindly evaluated the responses as "agree" or "disagree". In questions where there was disagreement between the evaluators, a third rheumatologist defined the dispute. Results The group with 5–10 years of experience had the best overall performance, with a 70% agreement probability with the evaluators, followed by ChatGPT 4.0 at 68%. The group with 21–30 years of experience had the worst performance (58%). ChatGPT 4.0 outperformed all other groups in questions regarding the first treatment option and the most effective imaging exams for investigation (100% in both). However, it had the poorest performance in identifying the most useful sign or symptom for diagnosing each disease. Conclusions ChatGPT 4.0 excelled in areas requiring less practical knowledge, such as treatment choices and diagnostic imaging exams. Conversely, it performed poorly in questions necessitating experience-based knowledge, particularly in identifying key diagnostic signs and symptoms. ChatGPT Rheumatology Artificial-Intelligence Diagnosis informatics Background Since ancient Greece, the concept of intelligence has been closely associated with wisdom and its authentic connection to reality [ 1 ]. In recent times, however, the notion of intelligence has increasingly assumed a quantitative character, driven by the progressive mathematization of thought throughout the 20th century. This trend is reflected in the growing replacement of clinical judgment with statistical and actuarial approaches [ 2 , 3 ], in which quantity has become the primary expression of reality. As a result, science—including medicine—has progressively adopted mathematical and statistical tools for risk assessment and decision-making in real-life contexts. With technological advancement, computational systems have acquired elements of "intelligence," characterized by their ability to estimate probabilities and suggest courses of action based on input data, giving rise to what is now known as artificial intelligence (AI). Since then, various levels of AI have been developed [ 4 ], ranging from early systems such as the chess engine Deep Blue to more advanced models like the Generative Pretrained Transformer Chat (ChatGPT), a sophisticated architecture based on deep learning. ChatGPT refers to a family of AI models developed by OpenAI, grounded in a large language model (LLM) that has undergone a process called reinforcement learning from human feedback (RLHF), which is fundamentally dependent on human input [ 5 ]. This development allows ChatGPT to produce coherent, context-aware responses in natural language when prompted on a broad range of topics. In medicine, rheumatology is among the most complex specialties, as it primarily deals with subjective symptoms and clinical signs, while diagnostic tests often lack definitive specificity or sensitivity [ 6 ]. Consequently, several imprecise, score-based tools have been developed to reduce variability in clinical decision-making, such as the Systemic Lupus Erythematosus Disease Activity Index 2000 (SLEDAI-2K) [ 7 ], the Clinical Disease Activity Index (CDAI) [ 8 ], and the Disease Activity Score 28-joint count (DAS-28) [ 9 ]. AI-based tools have the potential to assist physicians in navigating complex clinical scenarios by providing structured and reproducible outputs. While previous studies have evaluated AI in contexts such as diagnostic imaging and medical education, its application to routine clinical decision-making in rheumatology remains underexplored. Given that ChatGPT enables interaction through structured, logical discourse akin to the dialectical method, we aimed to compare the answers provided by ChatGPT-4.0 with those of rheumatologists at different levels of experience, using a standardized questionnaire addressing systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), ankylosing spondylitis (AS), psoriatic arthritis (PsA), and fibromyalgia (FM). Methods This was a cross-sectional study designed to compare the performance of ChatGPT-4.0 with that of rheumatologists at varying levels of clinical experience. All participants were affiliated with academic medical services in Salvador, Bahia, Brazil, and addressed clinical, diagnostic, and therapeutic questions related to systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), ankylosing spondylitis (AS), psoriatic arthritis (PsA), and fibromyalgia (FM). All participating physicians signed informed consent forms. The study received full approval from the relevant institutional ethics committees (CAEE 71172023.7.0000.5027) and was conducted entirely in Brazilian Portuguese. Questionnaire Administration A standardized questionnaire consisting of 25 questions—five for each of the five diseases—was developed (Table 1 ). The questionnaire was administered to four distinct groups of rheumatologists, each group comprising two professionals with different levels of clinical experience: Group A (< 5 years), Group B (5–10 years), Group C (11–20 years), and Group D (21–30 years). The questionnaires were applied in person during February and March 2024. Participants were instructed in writing to “answer objectively.” No interventions or clarifications were provided by the researchers during the response process. The same set of questions was submitted to ChatGPT-4.0 in April 2024. To maintain consistency, each question was asked in a separate chat session and preceded by the instruction: “Answer without explanations.” The responses were not edited, except for the removal of explanatory phrases and clarifying parentheses—for instance, expressions such as “malar rash (butterfly-shaped rash)” or “anti-CCP antibody (cyclic citrullinated peptide)” were stripped of their parenthetical components. Table 1 Questions applied to ChatGPT 4.0 and rheumatologists Question type 1: Name which sign or symptom, for you, is most useful in diagnosing [ SLE, RA, AS, PsA and FM] in clinical practice. Question type 2: Name the two diagnoses that you consider most closely mimic [ SLE, RA, AS, PsA and FM ]. Question type 3: Name, if any, the radiological examination that you think best contributes to the diagnosis of [ SLE, RA, AS, PsA and FM]. Question type 4: Which laboratory test do you consider having the greatest specificity for diagnosing [ SLE, RA, AS, PsA and FM ]? Question type 5: Considering pharmacological treatment, name your first therapeutic option for [ SLE, RA, AS, PsA and FM]. The five question types were made with the five diseases totaling a questionnaire with twenty-five questions. AS: Ankylosing spondylitis; FM: Fibromyalgia; PsA: Psoriatic arthritis; RA: Rheumatoid arthritis; SLE; Systemic lupus erythematosus. *All questions were submitted in Portuguese. Assessment of Responses All questionnaire responses were independently and blindly evaluated by two senior rheumatologists, each with over 30 years of clinical and academic experience. Each evaluator assigned a binary score to each response: 0 (disagree) or 1 (agree). In cases of disagreement between the two evaluators, a third rheumatologist served as an adjudicator to determine the final score. Thus, the total score for each completed questionnaire ranged from 0 (no agreement) to 25 (full agreement). All questions posed to both ChatGPT-4.0 and the rheumatologists were presented in Portuguese. Statistical Analysis Descriptive statistics included the calculation of frequencies and percentages. Mean scores were computed for each group. For example, if two rheumatologists in Group A received scores of 3 and 4 on the SLE questions, their mean score of 3.5 was used for comparison across other groups and against the score obtained by ChatGPT-4.0. An overall mean score for all rheumatologist responses was also calculated. Cohen’s kappa coefficient was employed to assess the level of inter-rater agreement between expert reviewers regarding the responses provided by rheumatologists across different levels of experience. Additionally, probability analyses were performed to evaluate relative performance between groups. All statistical analyses were conducted using SPSS software (version 25), and graphical representations were generated using Microsoft Excel. Results Overall data All participants answered the 25 questions. Initially, when questions were asked to ChatGPT 4.0 using only the “answer objectively” command (the same command given to rheumatologists), the answers were verbose and extensive, making it necessary to switch to a more specific command: “answer without explanations,” resulting in the answers used for the study (Suplementary 1). Of the total 225 questions evaluated, 67 required evaluation by a third rheumatologist due to disagreements between the first two evaluators. The final scores attributed to each question evaluated in this study are available in Supplementary File 2. Disagreements were analyzed according to disease, question type, and participants. They were more prevalent for AS (44%), Group C (34%), and type 2 questions (56%). More details are available in Table 2 . Table 2 Questions that required a third evaluator By Question Type 9 out 45 − 20% (Type 1) 25 out 45–56% (Type 2) 18 out 45 − 40% (Type 3) 9 out 45 − 20% (Type 4) 6 out 45 − 13% (Type 5) By Disease 15 out 45 − 33% (SLE) 6 out 45 − 13% (RA) 20 out 45 − 44% (AS) 15 out 45 − 33% (PsA) 11 out 45 − 24% (FM) By Participant 8 out 25– 32% (ChatGPT 4.0) 14 out 50 − 28% (Group A) 12 out 50 − 24% (Group B) 17 out 50 − 34% (Group C) 16 out 50 − 32% (Group D) *T ype 1: “Name which sign or symptom, for you, is most useful in diagnosing … in clinical practice.” ; Type 2: “Name the two diagnoses that you consider most closely mimic ….”; Type 3: “Name, if any, the radiological examination that you think best contributes to the diagnosis of ….”; Type 4: “Which laboratory test do you consider to have the greatest specificity for diagnosing …?”; Type 5: “Considering pharmacological treatment, name your first therapeutic option for ….”. * AS: Ankylosing spondylitis; FMG: Fibromyalgia; PsA: Psoriatic arthritis; RA: Rheumatoid arthritis; SLE; Systemic lupus erythematosus. * ChatGPT = Generative Pretrained Transformer Chat; Group A = Pair of Rheumatologists with less than 5 years of experience; Group B = Pair of Rheumatologists with 5 to 10 years of experience; Group C = Pair of Rheumatologists with 11 to 20 years of experience; Group D = Pair of Rheumatologists with 21 to 30 years of experience. The best overall performance (Table 3 ) was observed in Group B (70% probability of agreement with the evaluators), followed by ChatGPT 4.0 (68%). The worst overall performance was from Group D (58%). By diseases data In the analysis of answer performance across different diseases (Table 3 ), ChatGPT 4.0 outperformed the experts' average (calculated as the simple mean of the scores from the 8 rheumatologists) in questions about SLE, PsA, and FM. For AS, there was no difference between ChatGPT 4.0 and the experts' average. For RA, there was a slight superiority in favor of experts' average. When analyzing rheumatologist groups, those with 5–10 years of experience performed better than the other groups. By question type data In the analysis by question type (Table 3 ), ChatGPT 4.0 had the worst performance on type 1 questions, with a performance of 40%, compared to 78% obtained by the average of all rheumatologists (90% higher and 70% lower between rheumatologists groups). ChatGPT 4.0 also performed below experts’ average on type 4 questions. However, ChatGPT 4.0 had the best performance among all groups in type 3 and type 5 questions, with 100% agreement probability with the evaluators. Table 3 ChatGPT 4.0 and Specialists performance ChatGPT 4.0 Group A Group B Group C Group D Experts’ Avg Overall (%) 68 66 70 60 58 63,5 By Disease Score (%) SLE 80 80 70 60 60 68 RA 60 60 60 50 80 63 AS 60 60 70 60 50 60 PsA 60 60 70 50 30 53 FM 80 70 80 80 70 75 By Question Type Score (%) Type 1 40 80 90 70 70 78 Type 2 60 30 70 40 40 45 Type 3 100 80 60 60 70 68 Type 4 40 50 60 50 40 50 Type 5 100 90 70 80 70 78 *T ype 1: “Name which sign or symptom, for you, is most useful in diagnosing … in clinical practice.” ; Type 2: “Name the two diagnoses that you consider most closely mimic ….”; Type 3: “Name, if any, the radiological examination that you think best contributes to the diagnosis of ….”; Type 4: “Which laboratory test do you consider to have the greatest specificity for diagnosing …?”; Type 5: “Considering pharmacological treatment, name your first therapeutic option for ….”. * AS: Ankylosing spondylitis; FM: Fibromyalgia; PsA: Psoriatic arthritis; RA: Rheumatoid arthritis; SLE; Systemic lupus erythematosus. * Avg = Average (the simple mean of the scores from the 8 rheumatologists) ; ChatGPT = Generative Pretrained Transformer Chat; Group A = Pair of Rheumatologists with less than 5 years of experience; Group B = Pair of Rheumatologists with 5 to 10 years of experience; Group C = Pair of Rheumatologists with 11 to 20 years of experience; Group D = Pair of Rheumatologists with 21 to 30 years of experience. Discussion Technological advancements over the past century have significantly transformed medical knowledge and clinical practice. The complexity of modern medicine has introduced a growing number of variables beyond traditional clinical concerns, necessitating the incorporation of more quantitative approaches. As early as the second half of the 20th century, some authors proposed replacing clinical judgment with the use of mathematical formulas [ 10 ]. Despite the difficulties encountered in applying these formulas in practice, clinical judgment remains a cornerstone of medical practice, especially when interpreting the often-mechanistic results provided by statistical methods [ 11 ]. In this context, ChatGPT-4.0 presents itself as a novel tool capable of facilitating doctor-machine interactions that resemble a dialectical process rather than a purely logical-formal one, delivering responses in a conversational, human-like manner. This positions the AI as a potential information source, where data is provided in a more organic and articulated fashion. Previous studies have tested the performance of AI tools like ChatGPT in the context of medical examinations [ 12 ], and the potential applications of AI in medicine have been widely discussed [ 13 ]. However, the clinical use of AI tools, particularly in the field of rheumatology, remains underexplored, and further research is needed. Some studies comparing AI models and specialists in practical rheumatology settings [ 14 , 15 , 16 ] have yielded results consistent with ours: AI demonstrates generally acceptable performance, but with notable areas for improvement. Our findings highlighted a particular weakness of ChatGPT-4.0 when responding to questions about the most useful signs or symptoms for diagnosis—questions that require an understanding grounded in clinical experience. This contrasts with its strong performance on other types of questions. This discrepancy is likely due to the subjective nature of these diagnostic questions, which depend heavily on the clinician’s bedside experience in differentiating between possible disease manifestations. In evidence-based medicine, the terms sensitivity and specificity attempt, albeit imperfectly, to address this distinction [ 17 ]. This is especially critical in rheumatology, given the broad spectrum of disease manifestations and the overlap of symptoms across various conditions. Regarding treatment choices, imaging studies, and laboratory tests, it is unsurprising that ChatGPT-4.0 performed better than human experts. These decisions often follow established protocols and guidelines, to which ChatGPT has direct access. Moreover, such decisions tend to be less subjective and less controversial than diagnostic judgments about which signs or symptoms are most useful. Therefore, it makes sense that the AI’s performance would be strong in these areas. Another intriguing finding in our study was the lower performance of the most experienced group (Group D). We do not attribute this to the actual length of experience of these specialists, but rather to the nature of the questions being evaluated. As rheumatologists accumulate more experience, their perspectives become more nuanced and complex, making seemingly “simple questions” more challenging. In many cases, this reflects a departure from basic, standardized concepts, as more experienced clinicians focus on complex or individualized approaches to rheumatology. Furthermore, this complexity may be compounded by reduced involvement in academic settings. This is underscored by the fact that the evaluating experts—who themselves had over 30 years of experience—did not align with the responses of Group D. The performance of the most experienced group in our study suggests a need for continued medical education (CME) rather than a critique of their expertise, supported by the lack of agreement between Group D and the evaluators, who also possessed extensive clinical backgrounds. Our study also presents several limitations. One major issue is the variability in ChatGPT-4.0’s responses to the same questions. We minimized this variability by employing input that promoted more objective responses, but when using the original “answer objectively” command (the same given to human experts), the AI produced excessively verbose and inappropriate responses for the purposes of our evaluation. This led to a change in the input instruction to “answer objectively and without explanations.” Another limitation is the small sample size, which restricts the generalizability of our findings and limits the ability to draw definitive statistical conclusions regarding the utility of ChatGPT-4.0 in rheumatology. Additionally, ChatGPT-4.0 is subject to continuous updates and modifications, meaning that our study provides a snapshot of its performance as of April 2024 in its 4.0 version. Ultimately, the primary goal of our study was to describe the results obtained and highlight specific findings that could not only address existing questions but also serve as a springboard for future research. The performance of ChatGPT-4.0 in this study provides valuable insights and paves the way for more comprehensive evaluations of AI tools in the field of rheumatology. Conclusion ChatGPT-4.0 demonstrated a strong overall performance, surpassing some groups of rheumatologists in certain areas. The AI particularly excelled in questions related to treatment options and other topics requiring less clinical experience and practical knowledge. However, it showed weaker performance in questions concerning signs and symptoms, which are more dependent on clinical experience. Despite these findings, this study represents only a preliminary assessment of ChatGPT's potential applications in the field of rheumatology. Further research is needed to explore its full capabilities and limitations in clinical settings. Declarations Consent for publication Not applicable Availability of data and materials The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. Ethics approval and consent to participate Not applicable. Competing interests The authors declare that they have no competing interests Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or non-profit sectors. Authors' contributions Lucas Gonçalves – Data collection, data analysis and article writing. Carlos Antônio Moura – Article writing and article review. Acknowledgements Not applicable References Aristotle. De Anima (On the Soul). Penguin UK; 1986. Sternberg RJ, Intelligence. Dialogues Clin Neurosci. 2012;14(1):19–27. Jaime-Ramírez-Barba E, Vargas-Salado E, Domínguez-Garibaldi F, et al. Medicina, matemáticas y estadística. Aspectos prácticos. Bol Med Hosp Infant Mex. 1990;47(10):689–93. SITNFlash. The History of Artificial Intelligence. Science in the News. 2017. [accessed in 07 august] Available from: https://sitn.hms.harvard.edu/flash/2017/history-artificial-intelligence/ Introducing ChatGPT. OpenAI. [accessed on 07 August], Available from: https://openai.com/index/chatgpt/ Moura CA, Moura C. Geral. What is the essence of being a rheumatologist? The Rheumatologist. 2023 May 9. Gladman DD, Ibanez D, Urowitz MB. Systemic lupus erythematosus disease activity index 2000. J Rhuematol. 2002;29(2):288–91. Aletaha D, Nell VP, Stamm T, et al. Acute phase reactants add little to composite disease activity indices for rheumatoid arthritis: validation of a clinical activity score. Arthritis Res Therapy. 2005;7(4):R796–806. Van der Heijde DM, Jacobs JW. The original DAS and the DAS28 are not interchangeable: comment on the articles by Prevoo. Arthritis Rheum. 1998;41(5):942–5. 10.1002/1529-0131 . Meehl PE. When shall we use our heads instead of the formula? J Couns Psychol. 1957;4(4):268–73. Faust D et al. Response: Clinical and Actuarial Judgment.Science247,146–147(1990). 10.1126/science.247.4939.146.b Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388:1233–9. 10.1056/NEJMsr2214184 . Nicolaes J, Tselenti E, Aouad T, et al. Performance analysis of a deep-learning algorithm to detect the presence of inflammation in MRI of sacroiliac joints in patients with axial spondyloarthritis. Ann Rheum Dis Published Online First: 02 Oct. 2024. 10.1136/ard-2024-225862 . Mazzucchelli R, Turrado-Crespí P, Crespí-Villarías N, AB1462 COMPARATIVE ASSESSMENT OF THE ACCURACY AND SATISFACTION OF RESPONSES TO E-CONSULTATIONS IN RHEUMATOLOGY. CHAT-GPT VS SPECIALISTS (CORE-RC STUDY) Ann Rheum Dis. : 2024;83:2094. Xu D, Zhao J, Liu R, et al. ChatGPT4's proficiency in addressing patients' questions on systemic lupus erythematosus: a blinded comparative study with specialists. Rheumatology (Oxford). 2024;63(9):2450–6. 10.1093/rheumatology/keae238 . PMID: 38648756; PMCID: PMC11371377. Swift A, Heale R, Twycross A. What are sensitivity and specificity? Evid Based Nurs. 2019;23(1):2–4. Additional Declarations No competing interests reported. Supplementary Files Suplementary1.pdf Suplementary2.pdf Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 18 Aug, 2025 Reviews received at journal 17 Aug, 2025 Reviewers agreed at journal 14 May, 2025 Reviews received at journal 12 May, 2025 Reviewers agreed at journal 12 May, 2025 Reviewers invited by journal 09 May, 2025 Editor assigned by journal 02 May, 2025 Submission checks completed at journal 02 May, 2025 First submitted to journal 19 Apr, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6484816","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":455505875,"identity":"10a9affc-dade-4f0d-a042-dd4fe5d6d8f0","order_by":0,"name":"Lucas Gonçalves","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA8UlEQVRIiWNgGAWjYBACPmSOxMcGEMXYeACfFjZkjuTMBgYJoJYG4rVI84K1MDDg1yKR/PDjj182eQxihw/ett1hU6fbfhhoS41NNG4tacbSvH1pxQzSacnWuWfSJMzOJAK1HEvLbcCpJYdBmrHncGKDdI6ZdG7bYQmzA0AtjA2H8Wlh/vkTrCX/m7QlSMv5hwS1sEnw/ADbwibNCNJyg5AtPM/MrHkb0hLbpNOMLXvb0iS33QDakoDHL/zsyY9v/vhjk9gvnfzwxs82G36z8+kPH3yoscGpBQwY29AiiCEBn3Iw+ENQxSgYBaNgFIxkAABaPFxKBcP5kgAAAABJRU5ErkJggg==","orcid":"","institution":"Escola Bahiana de Medicina e Saúde Pública","correspondingAuthor":true,"prefix":"","firstName":"Lucas","middleName":"","lastName":"Gonçalves","suffix":""},{"id":455505876,"identity":"6cabdece-55c5-44b4-8c1b-c4daf7c99740","order_by":1,"name":"Carlos Antonio Moura","email":"","orcid":"","institution":"Hospital Santo Antônio – Obras Sociais Irmã Dulce","correspondingAuthor":false,"prefix":"","firstName":"Carlos","middleName":"Antonio","lastName":"Moura","suffix":""}],"badges":[],"createdAt":"2025-04-19 12:53:22","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6484816/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6484816/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":82706847,"identity":"c9fbeb62-9705-44c0-b2df-649c2a343e42","added_by":"auto","created_at":"2025-05-14 10:40:52","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":672978,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6484816/v1/fedbc46d-fd90-480a-b2ee-b375a0e0cdf9.pdf"},{"id":82706154,"identity":"4935163c-d8b6-499d-8b55-94d319096b10","added_by":"auto","created_at":"2025-05-14 10:32:47","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":2210822,"visible":true,"origin":"","legend":"","description":"","filename":"Suplementary1.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6484816/v1/dd98bfd3d01ad88a04f3481a.pdf"},{"id":82706841,"identity":"f354b517-974f-48d9-b50b-46164d79ebb5","added_by":"auto","created_at":"2025-05-14 10:40:47","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":421874,"visible":true,"origin":"","legend":"","description":"","filename":"Suplementary2.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6484816/v1/4b8684c45285d94fe404a875.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"ChatGPT Performance in a questionnaire on rheumatological diseases: A Comparison with Specialist’s Opinion","fulltext":[{"header":"Background","content":"\u003cp\u003eSince ancient Greece, the concept of intelligence has been closely associated with wisdom and its authentic connection to reality [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. In recent times, however, the notion of intelligence has increasingly assumed a quantitative character, driven by the progressive mathematization of thought throughout the 20th century. This trend is reflected in the growing replacement of clinical judgment with statistical and actuarial approaches [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], in which quantity has become the primary expression of reality. As a result, science\u0026mdash;including medicine\u0026mdash;has progressively adopted mathematical and statistical tools for risk assessment and decision-making in real-life contexts.\u003c/p\u003e \u003cp\u003eWith technological advancement, computational systems have acquired elements of \"intelligence,\" characterized by their ability to estimate probabilities and suggest courses of action based on input data, giving rise to what is now known as artificial intelligence (AI). Since then, various levels of AI have been developed [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], ranging from early systems such as the chess engine Deep Blue to more advanced models like the Generative Pretrained Transformer Chat (ChatGPT), a sophisticated architecture based on deep learning.\u003c/p\u003e \u003cp\u003eChatGPT refers to a family of AI models developed by OpenAI, grounded in a large language model (LLM) that has undergone a process called reinforcement learning from human feedback (RLHF), which is fundamentally dependent on human input [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. This development allows ChatGPT to produce coherent, context-aware responses in natural language when prompted on a broad range of topics.\u003c/p\u003e \u003cp\u003eIn medicine, rheumatology is among the most complex specialties, as it primarily deals with subjective symptoms and clinical signs, while diagnostic tests often lack definitive specificity or sensitivity [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Consequently, several imprecise, score-based tools have been developed to reduce variability in clinical decision-making, such as the Systemic Lupus Erythematosus Disease Activity Index 2000 (SLEDAI-2K) [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], the Clinical Disease Activity Index (CDAI) [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], and the Disease Activity Score 28-joint count (DAS-28) [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eAI-based tools have the potential to assist physicians in navigating complex clinical scenarios by providing structured and reproducible outputs. While previous studies have evaluated AI in contexts such as diagnostic imaging and medical education, its application to routine clinical decision-making in rheumatology remains underexplored. Given that ChatGPT enables interaction through structured, logical discourse akin to the dialectical method, we aimed to compare the answers provided by ChatGPT-4.0 with those of rheumatologists at different levels of experience, using a standardized questionnaire addressing systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), ankylosing spondylitis (AS), psoriatic arthritis (PsA), and fibromyalgia (FM).\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003eThis was a cross-sectional study designed to compare the performance of ChatGPT-4.0 with that of rheumatologists at varying levels of clinical experience. All participants were affiliated with academic medical services in Salvador, Bahia, Brazil, and addressed clinical, diagnostic, and therapeutic questions related to systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), ankylosing spondylitis (AS), psoriatic arthritis (PsA), and fibromyalgia (FM). All participating physicians signed informed consent forms. The study received full approval from the relevant institutional ethics committees (CAEE 71172023.7.0000.5027) and was conducted entirely in Brazilian Portuguese.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eQuestionnaire Administration\u003c/h2\u003e \u003cp\u003eA standardized questionnaire consisting of 25 questions\u0026mdash;five for each of the five diseases\u0026mdash;was developed (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The questionnaire was administered to four distinct groups of rheumatologists, each group comprising two professionals with different levels of clinical experience: Group A (\u0026lt;\u0026thinsp;5 years), Group B (5\u0026ndash;10 years), Group C (11\u0026ndash;20 years), and Group D (21\u0026ndash;30 years). The questionnaires were applied in person during February and March 2024. Participants were instructed in writing to \u0026ldquo;answer objectively.\u0026rdquo; No interventions or clarifications were provided by the researchers during the response process.\u003c/p\u003e \u003cp\u003eThe same set of questions was submitted to ChatGPT-4.0 in April 2024. To maintain consistency, each question was asked in a separate chat session and preceded by the instruction: \u0026ldquo;Answer without explanations.\u0026rdquo; The responses were not edited, except for the removal of explanatory phrases and clarifying parentheses\u0026mdash;for instance, expressions such as \u0026ldquo;malar rash (butterfly-shaped rash)\u0026rdquo; or \u0026ldquo;anti-CCP antibody (cyclic citrullinated peptide)\u0026rdquo; were stripped of their parenthetical components.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eQuestions applied to ChatGPT 4.0 and rheumatologists\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"1\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuestion type 1: Name which sign or symptom, for you, is most useful in diagnosing [ SLE, RA, AS, PsA and FM] in clinical practice.\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuestion type 2: Name the two diagnoses that you consider most closely mimic [ SLE, RA, AS, PsA and FM ].\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuestion type 3: Name, if any, the radiological examination that you think best contributes to the diagnosis of [ SLE, RA, AS, PsA and FM].\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuestion type 4: Which laboratory test do you consider having the greatest specificity for diagnosing [ SLE, RA, AS, PsA and FM ]?\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuestion type 5: Considering pharmacological treatment, name your first therapeutic option for [ SLE, RA, AS, PsA and FM].\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe five question types were made with the five diseases totaling a questionnaire with twenty-five questions. AS: Ankylosing spondylitis; FM: Fibromyalgia; PsA: Psoriatic arthritis; RA: Rheumatoid arthritis; SLE; Systemic lupus erythematosus. \u003cb\u003e*All questions were submitted in Portuguese.\u003c/b\u003e\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eAssessment of Responses\u003c/h3\u003e\n\u003cp\u003eAll questionnaire responses were independently and blindly evaluated by two senior rheumatologists, each with over 30 years of clinical and academic experience. Each evaluator assigned a binary score to each response: 0 (disagree) or 1 (agree). In cases of disagreement between the two evaluators, a third rheumatologist served as an adjudicator to determine the final score. Thus, the total score for each completed questionnaire ranged from 0 (no agreement) to 25 (full agreement). All questions posed to both ChatGPT-4.0 and the rheumatologists were presented in Portuguese.\u003c/p\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eStatistical Analysis\u003c/h2\u003e \u003cp\u003eDescriptive statistics included the calculation of frequencies and percentages. Mean scores were computed for each group. For example, if two rheumatologists in Group A received scores of 3 and 4 on the SLE questions, their mean score of 3.5 was used for comparison across other groups and against the score obtained by ChatGPT-4.0. An overall mean score for all rheumatologist responses was also calculated.\u003c/p\u003e \u003cp\u003eCohen\u0026rsquo;s kappa coefficient was employed to assess the level of inter-rater agreement between expert reviewers regarding the responses provided by rheumatologists across different levels of experience. Additionally, probability analyses were performed to evaluate relative performance between groups. All statistical analyses were conducted using SPSS software (version 25), and graphical representations were generated using Microsoft Excel.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eOverall data\u003c/h2\u003e \u003cp\u003eAll participants answered the 25 questions. Initially, when questions were asked to ChatGPT 4.0 using only the \u0026ldquo;answer objectively\u0026rdquo; command (the same command given to rheumatologists), the answers were verbose and extensive, making it necessary to switch to a more specific command: \u0026ldquo;answer without explanations,\u0026rdquo; resulting in the answers used for the study (Suplementary 1). Of the total 225 questions evaluated, 67 required evaluation by a third rheumatologist due to disagreements between the first two evaluators. The final scores attributed to each question evaluated in this study are available in Supplementary File 2.\u003c/p\u003e \u003cp\u003eDisagreements were analyzed according to disease, question type, and participants. They were more prevalent for AS (44%), Group C (34%), and type 2 questions (56%). More details are available in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eQuestions that required a third evaluator\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBy Question Type\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e9 out 45\u0026thinsp;\u0026minus;\u0026thinsp;20%\u003c/p\u003e \u003cp\u003e\u003cem\u003e(Type 1)\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e25 out 45\u0026ndash;56% \u003c/p\u003e \u003cp\u003e\u003cem\u003e(Type 2)\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e18 out 45\u0026thinsp;\u0026minus;\u0026thinsp;40%\u003c/p\u003e \u003cp\u003e\u003cem\u003e(Type 3)\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e9 out 45\u0026thinsp;\u0026minus;\u0026thinsp;20%\u003c/p\u003e \u003cp\u003e\u003cem\u003e(Type 4)\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e6 out 45\u0026thinsp;\u0026minus;\u0026thinsp;13%\u003c/p\u003e \u003cp\u003e\u003cem\u003e(Type 5)\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBy \u003cb\u003eDisease\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e15 out 45\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e33%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(SLE)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6 out 45\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e13%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(RA)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e20 out 45\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e44%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(AS)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e15 out 45\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e33%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(PsA)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e11 out 45\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e24%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(FM)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBy \u003cb\u003eParticipant\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e8 out 25\u0026ndash;\u003cb\u003e32%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(ChatGPT 4.0)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e14 out 50\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e28%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(Group A)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e12 out 50\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e24%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(Group B)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e17 out 50\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e34%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(Group C)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e16 out 50\u0026thinsp;\u0026minus;\u0026thinsp;\u003cb\u003e32%\u003c/b\u003e\u003c/p\u003e \u003cp\u003e\u003cem\u003e(Group D)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e*T\u003c/b\u003eype 1: \u0026ldquo;Name which sign or symptom, for you, is most useful in diagnosing \u0026hellip; in clinical practice.\u0026rdquo; ; Type 2: \u0026ldquo;Name the two diagnoses that you consider most closely mimic \u0026hellip;.\u0026rdquo;; Type 3: \u0026ldquo;Name, if any, the radiological examination that you think best contributes to the diagnosis of \u0026hellip;.\u0026rdquo;; Type 4: \u0026ldquo;Which laboratory test do you consider to have the greatest specificity for diagnosing \u0026hellip;?\u0026rdquo;; Type 5: \u0026ldquo;Considering pharmacological treatment, name your first therapeutic option for \u0026hellip;.\u0026rdquo;.\u003c/p\u003e \u003cp\u003e \u003cb\u003e*\u003c/b\u003eAS: Ankylosing spondylitis; FMG: Fibromyalgia; PsA: Psoriatic arthritis; RA: Rheumatoid arthritis; SLE; Systemic lupus erythematosus.\u003c/p\u003e \u003cp\u003e \u003cb\u003e*\u003c/b\u003eChatGPT\u0026thinsp;=\u0026thinsp;Generative Pretrained Transformer Chat; Group A\u0026thinsp;=\u0026thinsp;Pair of Rheumatologists with less than 5 years of experience; Group B\u0026thinsp;=\u0026thinsp;Pair of Rheumatologists with 5 to 10 years of experience; Group C\u0026thinsp;=\u0026thinsp;Pair of Rheumatologists with 11 to 20 years of experience; Group D\u0026thinsp;=\u0026thinsp;Pair of Rheumatologists with 21 to 30 years of experience.\u003c/p\u003e \u003cp\u003eThe best overall performance (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e) was observed in Group B (70% probability of agreement with the evaluators), followed by ChatGPT 4.0 (68%). The worst overall performance was from Group D (58%).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eBy diseases data\u003c/h2\u003e \u003cp\u003eIn the analysis of answer performance across different diseases (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e), ChatGPT 4.0 outperformed the experts' average (calculated as the simple mean of the scores from the 8 rheumatologists) in questions about SLE, PsA, and FM. For AS, there was no difference between ChatGPT 4.0 and the experts' average. For RA, there was a slight superiority in favor of experts' average. When analyzing rheumatologist groups, those with 5\u0026ndash;10 years of experience performed better than the other groups.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eBy question type data\u003c/h3\u003e\n\u003cp\u003eIn the analysis by question type (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e), ChatGPT 4.0 had the worst performance on type 1 questions, with a performance of 40%, compared to 78% obtained by the average of all rheumatologists (90% higher and 70% lower between rheumatologists groups). ChatGPT 4.0 also performed below experts\u0026rsquo; average on type 4 questions. However, ChatGPT 4.0 had the best performance among all groups in type 3 and type 5 questions, with 100% agreement probability with the evaluators.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eChatGPT 4.0 and Specialists performance\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eChatGPT 4.0\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGroup A\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGroup B\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eGroup C\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eGroup D\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eExperts\u0026rsquo; Avg\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eOverall\u003c/b\u003e (%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e68\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e66\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e58\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e63,5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"7\" nameend=\"c7\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003eBy Disease Score (%)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eSLE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e68\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRA\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e63\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAS\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003ePsA\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e53\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFM\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e75\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"7\" nameend=\"c7\" namest=\"c1\"\u003e \u003cp\u003e\u003cem\u003eBy Question Type Score (%)\u003c/em\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eType 1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e78\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eType 2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e30\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e45\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eType 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e100\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e68\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eType 4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e60\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e50\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eType 5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e100\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e90\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e80\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e78\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003e*T\u003c/b\u003eype 1: \u0026ldquo;Name which sign or symptom, for you, is most useful in diagnosing \u0026hellip; in clinical practice.\u0026rdquo; ; Type 2: \u0026ldquo;Name the two diagnoses that you consider most closely mimic \u0026hellip;.\u0026rdquo;; Type 3: \u0026ldquo;Name, if any, the radiological examination that you think best contributes to the diagnosis of \u0026hellip;.\u0026rdquo;; Type 4: \u0026ldquo;Which laboratory test do you consider to have the greatest specificity for diagnosing \u0026hellip;?\u0026rdquo;; Type 5: \u0026ldquo;Considering pharmacological treatment, name your first therapeutic option for \u0026hellip;.\u0026rdquo;.\u003c/p\u003e \u003cp\u003e \u003cb\u003e*\u003c/b\u003e AS: Ankylosing spondylitis; FM: Fibromyalgia; PsA: Psoriatic arthritis; RA: Rheumatoid arthritis; SLE; Systemic lupus erythematosus.\u003c/p\u003e \u003cp\u003e \u003cb\u003e*\u003c/b\u003e Avg\u0026thinsp;=\u0026thinsp;Average (the simple mean of the scores from the 8 rheumatologists) ; ChatGPT\u0026thinsp;=\u0026thinsp;Generative Pretrained Transformer Chat; Group A\u0026thinsp;=\u0026thinsp;Pair of Rheumatologists with less than 5 years of experience; Group B\u0026thinsp;=\u0026thinsp;Pair of Rheumatologists with 5 to 10 years of experience; Group C\u0026thinsp;=\u0026thinsp;Pair of Rheumatologists with 11 to 20 years of experience; Group D\u0026thinsp;=\u0026thinsp;Pair of Rheumatologists with 21 to 30 years of experience.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eTechnological advancements over the past century have significantly transformed medical knowledge and clinical practice. The complexity of modern medicine has introduced a growing number of variables beyond traditional clinical concerns, necessitating the incorporation of more quantitative approaches. As early as the second half of the 20th century, some authors proposed replacing clinical judgment with the use of mathematical formulas [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Despite the difficulties encountered in applying these formulas in practice, clinical judgment remains a cornerstone of medical practice, especially when interpreting the often-mechanistic results provided by statistical methods [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIn this context, ChatGPT-4.0 presents itself as a novel tool capable of facilitating doctor-machine interactions that resemble a dialectical process rather than a purely logical-formal one, delivering responses in a conversational, human-like manner. This positions the AI as a potential information source, where data is provided in a more organic and articulated fashion.\u003c/p\u003e \u003cp\u003ePrevious studies have tested the performance of AI tools like ChatGPT in the context of medical examinations [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e], and the potential applications of AI in medicine have been widely discussed [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. However, the clinical use of AI tools, particularly in the field of rheumatology, remains underexplored, and further research is needed. Some studies comparing AI models and specialists in practical rheumatology settings [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e, \u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] have yielded results consistent with ours: AI demonstrates generally acceptable performance, but with notable areas for improvement.\u003c/p\u003e \u003cp\u003eOur findings highlighted a particular weakness of ChatGPT-4.0 when responding to questions about the most useful signs or symptoms for diagnosis\u0026mdash;questions that require an understanding grounded in clinical experience. This contrasts with its strong performance on other types of questions. This discrepancy is likely due to the subjective nature of these diagnostic questions, which depend heavily on the clinician\u0026rsquo;s bedside experience in differentiating between possible disease manifestations. In evidence-based medicine, the terms sensitivity and specificity attempt, albeit imperfectly, to address this distinction [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. This is especially critical in rheumatology, given the broad spectrum of disease manifestations and the overlap of symptoms across various conditions.\u003c/p\u003e \u003cp\u003eRegarding treatment choices, imaging studies, and laboratory tests, it is unsurprising that ChatGPT-4.0 performed better than human experts. These decisions often follow established protocols and guidelines, to which ChatGPT has direct access. Moreover, such decisions tend to be less subjective and less controversial than diagnostic judgments about which signs or symptoms are most useful. Therefore, it makes sense that the AI\u0026rsquo;s performance would be strong in these areas.\u003c/p\u003e \u003cp\u003eAnother intriguing finding in our study was the lower performance of the most experienced group (Group D). We do not attribute this to the actual length of experience of these specialists, but rather to the nature of the questions being evaluated. As rheumatologists accumulate more experience, their perspectives become more nuanced and complex, making seemingly \u0026ldquo;simple questions\u0026rdquo; more challenging. In many cases, this reflects a departure from basic, standardized concepts, as more experienced clinicians focus on complex or individualized approaches to rheumatology. Furthermore, this complexity may be compounded by reduced involvement in academic settings. This is underscored by the fact that the evaluating experts\u0026mdash;who themselves had over 30 years of experience\u0026mdash;did not align with the responses of Group D. The performance of the most experienced group in our study suggests a need for continued medical education (CME) rather than a critique of their expertise, supported by the lack of agreement between Group D and the evaluators, who also possessed extensive clinical backgrounds.\u003c/p\u003e \u003cp\u003eOur study also presents several limitations. One major issue is the variability in ChatGPT-4.0\u0026rsquo;s responses to the same questions. We minimized this variability by employing input that promoted more objective responses, but when using the original \u0026ldquo;answer objectively\u0026rdquo; command (the same given to human experts), the AI produced excessively verbose and inappropriate responses for the purposes of our evaluation. This led to a change in the input instruction to \u0026ldquo;answer objectively and without explanations.\u0026rdquo; Another limitation is the small sample size, which restricts the generalizability of our findings and limits the ability to draw definitive statistical conclusions regarding the utility of ChatGPT-4.0 in rheumatology. Additionally, ChatGPT-4.0 is subject to continuous updates and modifications, meaning that our study provides a snapshot of its performance as of April 2024 in its 4.0 version.\u003c/p\u003e \u003cp\u003eUltimately, the primary goal of our study was to describe the results obtained and highlight specific findings that could not only address existing questions but also serve as a springboard for future research. The performance of ChatGPT-4.0 in this study provides valuable insights and paves the way for more comprehensive evaluations of AI tools in the field of rheumatology.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eChatGPT-4.0 demonstrated a strong overall performance, surpassing some groups of rheumatologists in certain areas. The AI particularly excelled in questions related to treatment options and other topics requiring less clinical experience and practical knowledge. However, it showed weaker performance in questions concerning signs and symptoms, which are more dependent on clinical experience. Despite these findings, this study represents only a preliminary assessment of ChatGPT's potential applications in the field of rheumatology. Further research is needed to explore its full capabilities and limitations in clinical settings.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors have not declared a specific grant for this research from any funding agency in the public, commercial or non-profit sectors.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors\u0026apos; contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eLucas Gon\u0026ccedil;alves \u0026ndash; Data collection, data analysis and article writing. Carlos Ant\u0026ocirc;nio Moura \u0026ndash; Article writing and article review.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAristotle. De Anima (On the Soul). Penguin UK; 1986.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSternberg RJ, Intelligence. Dialogues Clin Neurosci. 2012;14(1):19\u0026ndash;27.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJaime-Ram\u0026iacute;rez-Barba E, Vargas-Salado E, Dom\u0026iacute;nguez-Garibaldi F, et al. Medicina, matem\u0026aacute;ticas y estad\u0026iacute;stica. Aspectos pr\u0026aacute;cticos. Bol Med Hosp Infant Mex. 1990;47(10):689\u0026ndash;93.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSITNFlash. The History of Artificial Intelligence. Science in the News. 2017. [accessed in 07 august] Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://sitn.hms.harvard.edu/flash/2017/history-artificial-intelligence/\u003c/span\u003e\u003cspan address=\"https://sitn.hms.harvard.edu/flash/2017/history-artificial-intelligence/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eIntroducing ChatGPT. OpenAI. [accessed on 07 August], Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://openai.com/index/chatgpt/\u003c/span\u003e\u003cspan address=\"https://openai.com/index/chatgpt/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMoura CA, Moura C. Geral. What is the essence of being a rheumatologist? The Rheumatologist. 2023 May 9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGladman DD, Ibanez D, Urowitz MB. Systemic lupus erythematosus disease activity index 2000. J Rhuematol. 2002;29(2):288\u0026ndash;91.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAletaha D, Nell VP, Stamm T, et al. Acute phase reactants add little to composite disease activity indices for rheumatoid arthritis: validation of a clinical activity score. Arthritis Res Therapy. 2005;7(4):R796\u0026ndash;806.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVan der Heijde DM, Jacobs JW. The original DAS and the DAS28 are not interchangeable: comment on the articles by Prevoo. Arthritis Rheum. 1998;41(5):942\u0026ndash;5. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1002/1529-0131\u003c/span\u003e\u003cspan address=\"10.1002/1529-0131\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMeehl PE. When shall we use our heads instead of the formula? J Couns Psychol. 1957;4(4):268\u0026ndash;73.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFaust D et al. Response: Clinical and Actuarial Judgment.Science247,146\u0026ndash;147(1990).\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1126/science.247.4939.146.b\u003c/span\u003e\u003cspan address=\"10.1126/science.247.4939.146.b\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. N Engl J Med. 2023;388:1233\u0026ndash;9. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1056/NEJMsr2214184\u003c/span\u003e\u003cspan address=\"10.1056/NEJMsr2214184\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNicolaes J, Tselenti E, Aouad T, et al. Performance analysis of a deep-learning algorithm to detect the presence of inflammation in MRI of sacroiliac joints in patients with axial spondyloarthritis. Ann Rheum Dis Published Online First: 02 Oct. 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1136/ard-2024-225862\u003c/span\u003e\u003cspan address=\"10.1136/ard-2024-225862\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMazzucchelli R, Turrado-Cresp\u0026iacute; P, Cresp\u0026iacute;-Villar\u0026iacute;as N, AB1462 COMPARATIVE ASSESSMENT OF THE ACCURACY AND SATISFACTION OF RESPONSES TO E-CONSULTATIONS IN RHEUMATOLOGY. CHAT-GPT VS SPECIALISTS (CORE-RC STUDY) Ann Rheum Dis. : 2024;83:2094.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXu D, Zhao J, Liu R, et al. ChatGPT4's proficiency in addressing patients' questions on systemic lupus erythematosus: a blinded comparative study with specialists. Rheumatology (Oxford). 2024;63(9):2450\u0026ndash;6. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/rheumatology/keae238\u003c/span\u003e\u003cspan address=\"10.1093/rheumatology/keae238\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. PMID: 38648756; PMCID: PMC11371377.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSwift A, Heale R, Twycross A. What are sensitivity and specificity? Evid Based Nurs. 2019;23(1):2\u0026ndash;4.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"advances-in-rheumatology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"adrh","sideBox":"Learn more about [Advances in Rheumatology](https://advancesinrheumatology.biomedcentral.com/)","snPcode":"42358","submissionUrl":"https://submission.springernature.com/new-submission/42358/3","title":"Advances in Rheumatology","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"ChatGPT, Rheumatology, Artificial-Intelligence, Diagnosis, informatics","lastPublishedDoi":"10.21203/rs.3.rs-6484816/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6484816/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003eThis study aims to compare the performance of Generative Pretrained Transformer Chat 4.0 (ChatGPT 4.0) with rheumatologists of varying experience levels in a questionnaire on systemic lupus erythematosus (SLE), rheumatoid arthritis (RA), ankylosing spondylitis (AS), psoriatic arthritis (PsA), and fibromyalgia (FM).\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eIn this cross-sectional study, a 25-question questionnaire (five questions per disease) was administered to ChatGPT 4.0 and four pairs of rheumatologists with different experience levels (less than 5 years, 5\u0026ndash;10 years, 11\u0026ndash;20 years, and 21\u0026ndash;30 years). Two rheumatologists with more than 30 years of experience and linked to academic services blindly evaluated the responses as \"agree\" or \"disagree\". In questions where there was disagreement between the evaluators, a third rheumatologist defined the dispute.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eThe group with 5\u0026ndash;10 years of experience had the best overall performance, with a 70% agreement probability with the evaluators, followed by ChatGPT 4.0 at 68%. The group with 21\u0026ndash;30 years of experience had the worst performance (58%). ChatGPT 4.0 outperformed all other groups in questions regarding the first treatment option and the most effective imaging exams for investigation (100% in both). However, it had the poorest performance in identifying the most useful sign or symptom for diagnosing each disease.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e \u003cp\u003eChatGPT 4.0 excelled in areas requiring less practical knowledge, such as treatment choices and diagnostic imaging exams. Conversely, it performed poorly in questions necessitating experience-based knowledge, particularly in identifying key diagnostic signs and symptoms.\u003c/p\u003e","manuscriptTitle":"ChatGPT Performance in a questionnaire on rheumatological diseases: A Comparison with Specialist’s Opinion","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-14 10:32:42","doi":"10.21203/rs.3.rs-6484816/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-08-18T17:09:56+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-17T15:45:05+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"285513445577686513793545017554014963334","date":"2025-05-14T21:10:22+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-05-12T16:35:54+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"145128473629026555592855871951088027367","date":"2025-05-12T12:27:10+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-05-09T15:10:28+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-05-02T07:20:03+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-05-02T07:17:30+00:00","index":"","fulltext":""},{"type":"submitted","content":"Advances in Rheumatology","date":"2025-04-19T12:46:20+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"advances-in-rheumatology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"adrh","sideBox":"Learn more about [Advances in Rheumatology](https://advancesinrheumatology.biomedcentral.com/)","snPcode":"42358","submissionUrl":"https://submission.springernature.com/new-submission/42358/3","title":"Advances in Rheumatology","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"BMC/SO AJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"58afc03e-2eba-489d-8f3d-451e7a827226","owner":[],"postedDate":"May 14th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-02T17:38:57+00:00","versionOfRecord":[],"versionCreatedAt":"2025-05-14 10:32:42","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6484816","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6484816","identity":"rs-6484816","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0