Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives

doi:10.21203/rs.3.rs-8320702/v1

Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives

2026 · doi:10.21203/rs.3.rs-8320702/v1

preprint OA: closed

Full text JSON View at publisher

Full text 103,454 characters · extracted from preprint-html · click to expand

Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives Ashish J Johnson, Tarun Kumar Singh, R Periyasamy, Aakash Gupta, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8320702/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 20 You are reading this latest preprint version Abstract Aim: This study aimed to evaluate the validity and reliability of prominent AI chatbots—ChatGPT, Perplexity, Claude, and Gemini—in the comparative diagnosis and definitive management of deep caries, guided by global position statements from endodontic organizations, as assessed by post-graduate students and clinicians. Methods: Four AI chatbots (ChatGPT, Perplexity, Claude, and Gemini) were accessed through their respective APIs using pro versions. Ten short case histories representing a spectrum of deep caries scenarios, along with corresponding position statements from the European Society of Endodontology, American Association of Endodontists, Indian Endodontic Society and others, were provided to each chatbot. Chatbots were prompted to generate diagnostic and management responses, which were repeated thrice per case per chatbot. Responses were evaluated by two postgraduate students and three senior clinicians using a 5-point Likert scale and an adapted Global Quality Score (GQS) for validity, and Cronbach’s alpha for reliability. Statistical analysis included low- and high-threshold validity tests and intergroup reliability comparisons. Conclusion: Perplexity exhibited the highest reliability and validity in deep caries diagnosis and management compared to ChatGPT, Claude, and Gemini. While Perplexity, Claude, and Gemini demonstrated perfect or near-perfect validity at low-threshold criteria, only Perplexity maintained moderate validity at high-stringency levels. Overall variability and reduced descriptive depth across all chatbot outputs highlight current limitations for clinical implementation. AI chatbots may serve as useful educational or adjunctive tools but cannot substitute professional judgment in endodontic diagnosis and treatment. Future development should focus on enhancing performance mechanisms and regulatory oversight to support clinical accuracy and reliability. Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Deep caries management represents one of the most rapidly evolving in the field of contemporary restorative and endodontic practice. Two primary factors have contributed to the enhancement of this field: the development of enhanced biomaterials and a deeper understanding of pulpal biology. They have substituted conventional invasive treatments with biologically designed, minimally invasive therapies that retain the pulp vitality to a great extent possible 1 . Vital pulp therapy (VPT), previously seen as having limited applications in adult endodontics, is now a validated and evidence-based treatment procedure in suitably selected cases. Global restorative and endodontic organisations have established clinical practice guidelines and position statements to guide practitioners in this domain. All differ in minor perspectives, however, all articulate the same ultimate objective: to safeguard the pulp and maintain the vitality of the tooth 2 . The American Association of Endodontists (AAE) has developed a standardised diagnostic language and clinical guidelines that serve as the benchmark for care 3 . The Indian Endodontic Society (IES) advocates for biologically oriented techniques and endorses treatment methods aimed at pulp preservation whenever possible 4 . The European Society of Endodontology (ESE) has established consensus standards that underscore pulp preservation techniques, including the incremental treatment of extensive caries 5 and also on pulpal diagnosis 6 , 7 . Australia 8 , 9 and German 10 position statements predominantly conform to these principles, establishing only minimal parameters for the definitions of the diagnostic threshold, the timing of intervention, and the measurement of therapeutic outcomes. Wolters' method introduces an additional level of evidence to the categorisation of pulpal and periapical disorders. It not only standardises diagnoses but also facilitates the uniformity of clinical and research settings globally 11 . Despite the numerous regulations, a significant problem persists: indviduals interpret the diagnosis on clinical experience over the evidence. Different clinicians and students may approach identical situations differently based on the significance they attribute to radiographic findings, case history, or the diagnostic method employed. These differences necessitate the execution of thorough evaluations, judicious case selection, and the formulation of individualized treatment plans. Global discourse among endodontists indicates that the management of deep caries is not merely a technical concern, but a complex process rooted in biological decision-making. The creation of shared guidelines suggests a shift towards consensus; nonetheless, variability indicates that clinical judgment remains fundamental to patient-centered treatment. AI chatbots have revolutionized digital communication by markedly improving the quality of human connection. Utilising deep learning techniques, these chatbots are taught on vast datasets and perpetually enhance their answer precision and relevancy by simulating the human brain networks. Recent years have demonstrated a heightened dependence on AI chatbots as a public information source for patients and for clinicians in the medical domain. This has been studied in endodontics 12 , 13 and dental trauma 14 in recent years. However, investigations have demonstrated the prevalence of misinformation in some domains of medical sciences. Although all chatbots operate interactively with human inputs and strive to deliver optimal feedback, apprehensions have emerged over its validity and reliability of their responses, especially in medical domains. The utilisation of AI-based chatbots has gained popularity among students and clinicians for decision-making regarding diagnosis in situations of uncertainty in establishing an appropriate treatment plan and treatment facilities, respectively. The management of deep caries, as a developing concept, has consistently presented a problem in selecting the appropriate diagnosis. This study seeks to assess the reliability and validity of AI chatbots in the comparative diagnosis of cases based on various scenarios and their corresponding treatment plans according to different position statements on deep caries management by different global organisations from post graduate student and endodontist aspects. Methodology Data collection The primary application programming interface (API) of each chatbot was used to simulate real-world interactions more accurately. For ChatGPT 5, the API was accessed through the URL https://chat.openai.com/ using the Pro version, with all questions and answers obtained on the same date. The Perplexity API was accessed via https://www.perplexity.ai/, with all questions asked on the same day. The Google Gemini API was accessed through https://gemini.google.com/ using the Google Chrome browser, and all questions were asked on the same date. The Claude AI API was accessed via https://claude.ai/chats/, with all questions asked on the same day. All the chatbots Pro versions were used to obtain quality answers and also to avoid the limited usability in free versions. Ten short case histories were fabricated, including scenarios of minimally invasive restorative to deep caries management with radiographic findings. The case histories are available on the Supplemental Table 1. All the position statements on deep caries management were uploaded, including the AAE 3 , ESE 5 , IES 4 , AES 8,9 and German 10 Endodontic Societies and also the Wolters method 11 to provide additional evidence for the categorisation of pulpal diagnosis on each chatbot before giving prompts. Each of the four AI chatbots was then asked to provide a comparative diagnosis and management of the 10 case histories based on the position statements previously uploaded. To assess validity and reliability, each question was asked three times. A new chat session was created for each question on each chatbot, and the “More balanced” conversation style was selected for the responses. Each questions were asked one after another on the same chatbot after each response was produced. The three responses for each case history were recorded and used for scoring the validity and reliability. Before evaluating the chatbot responses, in order to avoid the investigator bias by the clinician and the postgraduate students, Journal clubs and group discussions were conducted on each position statement. All responses were assessed by the two endodontic postgraduate students and three senior clinician which include two professors (A.J.J., P.R., T.K.S, A.G., I.G.) on a 5-point Likert scale. A revised iteration of the Global Quality Score (GQS) 15 was utilised to assign ratings based on the responses' "content" and "context." • Score 5 (Strongly Agree): The answer is correct, and the content is comprehensive. • Score 4 (Agree): The answer is correct and most of the content is correct, but it lacks information or contains incorrect information. • Score 3 (Neutral): The answer is somewhat correct, but details are primarily incorrect, missing, or irrelevant. • Score 2 (Disagree): The answer is incorrect, but the content includes some correct elements. • Score 1 (Strongly Disagree): The answer and the entire content are incorrect or irrelevant. Upon completion of the scoring, it underwent evaluation for validity analysis. The reliability evaluation conducted by the examiners was quantified using a binary scale (0 and 1) and analysed by Cronbach's alpha methodology 16 . Analysis of Validity The responses were scored and then classified as "valid" or "invalid," reflecting how well each chatbot's response matched the intended response. Two validity tests were employed: a low-threshold test and a high-threshold test. In the low-threshold test, a threshold score of 4 was established, whereby a response was considered genuine if the chatbot achieved a score of at least 4 from 3 responses. Scores under 4 were deemed invalid. In the high-threshold test, a response was deemed genuine only if all three of the responses obtained a perfect score of 5, with a threshold of 5. A score below 5 rendered the response invalid. The accuracy of responses from different chatbots was evaluated using the Fisher Exact test. Analysis of reliability Reliability denotes the constancy with which the chatbot produces like responses under same settings. Cronbach's alpha was computed for each trio of replies throughout the 20 questions to examine response consistency. Cronbach's alpha is quantified on a scale ranging from 0 to 1, where 1 signifies perfect reliability and 0 denotes complete unreliability. A high alpha coefficient indicates that the chatbot frequently delivered comparable responses, signifying substantial reliability. A diminished alpha coefficient signifies reduced consistency, indicating lower reliability. Results Each of the four chatbots has responded to the short case histories, providing a total of 120 responses for differential diagnosis based on the position statements. All the responses were recorded in Supplemental Table 2. During the repeated chatbot prompts, the descriptive explanation of the chatbots was reduced after the first responses, making the validity questionable on some of the confusing case history(6 th ) case history. The mean scores for ChatGPT, Perplexity, Claude, and Gemini were compared across ten evaluation questions. Results showed that Gemini and Perplexity consistently achieved higher mean scores for most questions, frequently scoring above 4, while ChatGPT and Claude generally had slightly lower scores(Figure 1). Gemini demonstrated strong performance, particularly on questions 4, 5, and 7, where its scores approached or reached 4.67 and 4.33, indicating robust consistency and user preference. Perplexity also performed well, with its highest mean scores of 5 on questions 4, 5, and 8,. ChatGPT’s mean scores ranged from 2.33 to 4.67, with lower averages evident on questions 2, 4, and 6, while Claude’s means varied from 2.67 to 5, peaking at question 7, but dipping on some others. Overall, the comparative findings highlight that Perplexity and Gemini frequently outperformed the others on these evaluation metrics, with ChatGPT and Claude showing slightly more variability. Low-Threshold Validity Test In the low-threshold validity test, Perplexity, Claude, and Gemini demonstrated complete accuracy, each producing 100% valid responses with no invalid responses detected across all evaluation criteria. This indicates these models consistently met the validity requirements for every test item included in the assessment. In contrast, ChatGPT presented valid responses in 70% of cases, but 30% were deemed invalid, showing noticeably reduced reliability compared to its peers under the same low-threshold conditions (Figure 2). The findings highlight a clear performance distinction, with Perplexity, Claude, and Gemini exhibiting perfect validity rates while ChatGPT fell short, suggesting that the latter may require further improvement or review when assessed for low-threshold response validity. The intergroup comparison of low-threshold validity among the four chatbots revealed meaningful insights into their relative performance in generating valid responses. A 100% validity rate was observed for Perplexity, Claude, and Gemini, indicating that these chatbots consistently produced responses scoring 4 or higher across all three answer attempts per question. Conversely, ChatGPT demonstrated a notably lower low-threshold validity rate of 70%, with three questions falling below the validity threshold. Statistical testing using pairwise comparisons showed a significant difference between Perplexity and ChatGPT (P=0.049), suggesting that Perplexity’s responses were significantly more valid under this criterion. Similarly, ChatGPT’s performance was significantly poorer than Claude and Gemini, with P-values of 0.049 in both comparisons, indicating a consistent trend of lower validity for ChatGPT compared to these models. However, comparisons among Perplexity, Claude, and Gemini did not reveal any statistically significant differences, reflecting comparable validity profiles at the low-threshold level for these three chatbots (Table1). These findings underscore variability in chatbot response accuracy at a less stringent threshold, highlighting ChatGPT as less reliable in consistently meeting the validity cut-off. The high validity rates for Perplexity, Claude, and Gemini demonstrate robust performance in generating acceptable answers across question repetitions, a desirable trait for medical or scientific chatbot applications. Table 1 :Intergroup comparison of low threshold validity (*-significant)(x-Non significant) Perplexity Chat GPT Claude Gemini Perplexity P=0.049 (*) P=1.000(x) P=1.000(x) Chat GPT P=0.049 (*) P=0.049 (*) P=0.049 (*) Claude P=1.000(x) P=0.049 (*) P=1.000(x) Gemini P=1.000(x) P=0.049 (*) P=1.000(x) High-Threshold Validity In this high-threshold validity test, Perplexity achieved 30% valid responses, showing better performance than other chatbots but still a majority of invalid responses at 70%. Claude produced only 10% valid responses and 90% invalid, indicating low efficacy under stringent validity criteria. ChatGPT and Gemini scored no valid responses, with all answers classified as invalid, highlighting considerable challenges in meeting high-threshold validity requirements in this assessment. Overall, the high-threshold test revealed substantial difficulty for all chatbots, with Perplexity showing the highest but still limited validity, while ChatGPT and Gemini failed to provide any valid responses.In the high-threshold validity test, all chatbots faced considerable difficulty in producing valid responses. Perplexity showed relatively better performance with 30% valid responses, although the majority (70%) were invalid. Claude achieved only 10% validity, with 90% invalid responses, indicating limited accuracy under more stringent criteria. ChatGPT and Gemini struggled the most, with zero valid responses and all their answers classified as invalid. These results suggest that as the validity threshold increases, the ability of these chatbots to provide accurate responses significantly diminishes with Perplexity performing the best among them, and ChatGPT and Gemini demonstrating the greatest challenges to meeting high-threshold validity standards. The intergroup comparison of high threshold validity among the four chatbots revealed a general decline in validity rates compared to the low threshold criteria, with more variable performance observed across the groups. Perplexity achieved the highest validity rate of 30%, followed by Claude with 10%, while both ChatGPT and Gemini had validity rates of 0%. Statistical analysis showed a significant difference between Perplexity and ChatGPT (P=0.049) and between Perplexity and Gemini (P=0.049), indicating that Perplexity's responses were significantly more valid at this stricter criterion. In contrast, comparisons between Perplexity and Claude (P=0.263), ChatGPT and Claude (P=0.304), ChatGPT and Gemini (P=1.000), and Claude and Gemini (P=0.304) were not statistically significant, reflecting no meaningful difference in performance among these pairs(Table 2). These results suggest that while Perplexity maintained relatively better high-threshold validity, the other chatbots struggled to consistently produce perfect responses across all three repetitions, indicating challenges in achieving the highest level of accuracy. Table 2: Intergroup comparison of High-threshold validity (*-significant)(x-Non significant) Perplexity Chat GPT Claude Gemini Perplexity P=0.049 (*) P=0.263 (x) P=0.049 (*) Chat GPT P=0.049 (*) P=0.304 (x) P=1.000 (x) Claude P=0.263 (x) P=0.304 (x) P=0.304 (x) Gemini P=0.049 (*) P=1.000 (x) P=0.304 (x) Reliability Perplexity demonstrated the highest mean reliability score of 0.90 with a standard deviation of 0.31, indicating strong and consistent reliability across measurements. Claude followed with a mean reliability of 0.80 but showed greater variability with an SD of 0.48, suggesting some inconsistency in its performance. Gemini had a moderate mean reliability of 0.70 with a relatively higher SD of 0.52, reflecting more fluctuations in reliability compared to Perplexity and Claude. ChatGPT exhibited the lowest mean reliability at 0.50, coupled with the highest variability (SD = 0.42), indicating less consistent reliability among the groups assessed(Figure 4). Overall, Perplexity appears to be the most reliable chatbot with consistent performance, while ChatGPT demonstrated the least reliability and the greatest inconsistency in this comparison. The intergroup comparison of reliability among the four chatbots revealed significant differences in the consistency of their responses. Perplexity exhibited the highest mean reliability score of 0.90 with a standard deviation of 0.31, indicating a stable and consistent response pattern. ChatGPT showed a notably lower mean reliability of 0.50 with greater variability (SD = 0.42), suggesting less consistent responses across repeated questions. Claude and Gemini had intermediate mean reliability scores of 0.80 and 0.70, respectively, with relatively higher standard deviations (0.48 and 0.52), indicating moderate consistency. Statistical analysis showed that the difference in reliability between Perplexity and ChatGPT was significant (P = 0.026), highlighting Perplexity’s superior consistency. However, comparisons between Perplexity and Claude (P = 0.586) and Perplexity and Gemini (P = 0.310) were not statistically significant, reflecting comparable reliability performance among these groups. Similarly, reliability differences between ChatGPT and Claude (P = 0.154), ChatGPT and Gemini (P = 0.354), and Claude and Gemini (P = 0.660) were also not significant, indicating no clear superiority among these pairs(Table 3). These findings suggest that Perplexity provides the most reliable and consistent responses among the chatbots evaluated. The moderate variability in reliability among the other chatbots, particularly ChatGPT, may impact their dependability in applications where consistent response quality is critical. Table 3: Intergroup reliability comparison.(*-significant)(x-Non significant) Perplexity Chat GPT Claude Gemini Perplexity P=0.026(*) P=0.586(x) P=0.310(x) Chat GPT P=0.026(*) P=0.154 (x) P=0.354 (x) Claude P=0.586(x) P=0.154 (x) P=0.660 (x) Gemini P=0.310(x) P=0.354 (x) P=0.660 (x) Discussion AI chatbots have emerged as a prominent source of information in recent times 13 . Their integration into the medical sector has enhanced resource efficiency and diminished the necessity for substantial labour, hence increasing the accessibility of medical information to the public, students and clinicians 17 , 18 . The present study set out to evaluate the validity and reliability of AI chatbot responses used for the comparative diagnosis and definitive management of deep caries, guided by position statements from leading global endodontic organizations. Our findings indicate marked variations among the four evaluated chatbots—ChatGPT, Perplexity, Claude, and Gemini—in both low-threshold and high‐threshold validity tests as well as in reliability measures, which have important implications for integrating such AI tools into clinical practice for deep caries management 12 , 19 . The difficulty in a definitive diagnosis on deep caries management, particularly when vital pulp therapy is indicated. In the low-threshold validity test, where a response was deemed acceptable if it achieved a score of at least 4 on a 5‐point scale in three repeated iterations, Perplexity, Claude, and Gemini each produced 100% valid responses, whereas ChatGPT achieved only a 70% validity rate. This discrepancy suggests that, although the majority of the chatbots are capable of retrieving and synthesizing guideline‐based information, there remains substantial risk with certain platforms—most notably ChatGPT—in delivering responses that meet even minimal clinical standards 20 .Under conditions that demanded complete perfection—a high‐threshold validity test where only responses scoring a perfect 5 on all three attempts were considered valid—the performance of the evaluated chatbots deteriorated sharply. Perplexity achieved a 30% validity rate under these stringent standards, while both ChatGPT and Gemini recorded 0% valid responses, and Claude fell short with merely 10% valid responses. These findings highlight the difficulties AI systems encounter when adhering to rigorous criteria of clinical accuracy and comprehensiveness, especially in intricate diagnostic situations like deep caries management 21 , 22 . Case history number 6 on the list challenged the validity of the all the chatbots on each response are a perfect example of their diagnostic ability. With respect to the reliability, measured by using Cronbach’s alpha for repeated responses to the same prompts(case histories), Perplexity demonstrated the highest consistency with a mean reliability score of 0.90 and a modest standard deviation, indicating that its responses remained highly stable across trials. In contrast, ChatGPT exhibited a mean reliability of merely 0.50 with a higher standard deviation, signaling significant inconsistency in its performance. Claude and Gemini fall in between these extremes, with mean reliability scores of 0.80 and 0.70 respectively, alltogether with greater variance than Perplexity. This inter-platform variability in reliability is likely attributable to differences in model architecture, training data, and response generation algorithms, all of which influence the capacity of an AI system to consistently align with established clinical guidelines 20 , 23 . Position statements and guidelines provided by authoritative bodies like the American Association of Endodontists (AAE), the Indian Endodontic Society (IES), and the European Society of Endodontology (ESE) emphasize minimally invasive, biologically oriented approaches to pulp preservation. Our study used these position statements as the benchmark for assessing the chatbot responses, yet even when provided with extensive guideline input, the AI platforms exhibited significant inconsistencies, particularly demonstrating poor performance under high-threshold conditions 19 , 24 . The poor high-threshold performance of ChatGPT and Gemini can be due to the inherent complexity in translating sophisticated clinical guidelines into precise, contextually relevant diagnostic recommendations. In many cases, an AI model may retrieve pertinent information yet fail to integrate it in a manner that reflects the full breadth of clinical complexity—potentially due to limitations in the training data or the inability to contextualize radiographic and clinical findings in a manner that mirrors the judgment of experienced clinicians 25 , 26 .Another critical dimension of our evaluation is the methodological design that involved posing ten fabricated case histories based on a range of scenarios—from minimally invasive restorative treatments to complex deep caries management involving distinct radiographic findings—and repeating every query thrice to assess intra‐model consistency. This rigorous experimental design enabled us to identify not only the average performance metrics of each chatbot but also the variability inherent in their responses. The Global Quality Score (GQS) employed for evaluating responses offered a detailed assessment of content accuracy and contextual appropriateness. Even minor deviations from anticipated responses can render a case invalid under stringent criteria, underscoring the sensitivity of AI systems to input variations and the necessity for near-perfect consistency in domains where patient outcomes rely on precise diagnostic and treatment protocols 27 .Beyond the quantitative metrics, the variability in chatbot performance carries important implications for clinical decision‐making and patient management. In modern dental practice—particularly in endodontics—the misdiagnosis or mismanagement of deep caries can lead to either overtreatment, which may unnecessarily compromise the vitality of the pulp, or undertreatment, with the risk of further pulp degeneration and infection. The present study’s findings suggest that the inconsistent performance of some AI chatbots may, at present, compromise their utility as stand‐alone diagnostic tools, advocating instead for their role as adjunctive aids that complement, rather than replace, expert clinical judgment. Ethical considerations regarding the information disseminated by AI chatbots must be meticulously handled 28 . In practical terms, the adoption of AI chatbots in daily dental practice and in post graduate training should be viewed as a supplementary resource rather than a definitive diagnostic instrument. Routine use in areas such as second opinion generation, patient education, and preliminary triaging of deep caries cases can provide valuable support to clinicians, particularly in resource-limited settings or in cases where immediate expert consultation is not available 29 . Nevertheless, the current study’s findings caution against over-reliance on any single AI platform until improvements in both validity and reliability are consistently demonstrated across a wider range of clinical scenarios. It is crucial to remember that the case histories selected might affect the replies and outcomes. The study's limitations encompass possible discrepancies in assessments by individual evaluators. Future research could be enhanced by including a greater quantity of case histories and other clinical scenarios, as well as by engaging additional experts in the evaluation panel. Conclusion This study demonstrated that Perplexity exhibited the highest reliability and validity in deep caries diagnosis and management compared with ChatGPT, Claude, and Gemini. The observed variability and decline in descriptive depth across all models underscore their current limitations for direct clinical application. While these chatbots show potential as educational aids in endodontics, they cannot substitute for professional judgment. Future investigations should examine underlying performance mechanisms, evaluate broader endodontic contexts, and assess prompt design and multimodal integration. Establishing specialized, evidence-based AI models supported by regulatory oversight is essential to ensure accuracy, reliability, and transparency in clinical implementation. Declarations COMPETING INTEREST The authors declare no competing interests. ETHICS DECLARATION Not applicable. FUNDING The authors received no specific funding for this work CONSENT TO PUBLISH DECLARATION Not applicable CONSENT TO PARTICIPATE DECLARATION Not applicable AVAILABILITY OF DATA AND MATERIALS The datasets generated and/or analyzed during the present study are not publicly available due to patient privacy restrictions, but can be obtained from the corresponding author upon reasonable request. ACKNOWLEDGEMENTS This work has been done in the All India Institute of Medical Sciences (AIIMS), Bathinda AUTHOR CONTRIBUTIONS All the authors have made relevant contributions to the manuscript. The Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Roles/Writing – original draft, Writing – review & editing was done by all the five authors and Project administration, Resources, Software, Supervision, Validation, Visualization, Roles/Writing – original draft, Formal analysis, Writing – review & editing,Validation was done by the first and second author. References Fraser J, Webster S. What do we really know about vital pulp therapy? Evid Based Dent 2024;25:102–3. https://doi.org/10.1038/s41432-024-01008-4. Colloc TNE, Tomson PL. Vital pulp therapies in permanent teeth: what, when, where, who, why and how? Br Dent J 2025;238:458–68. https://doi.org/10.1038/s41415-025-8560-3. AAE Position Statement on Vital Pulp Therapy. Journal of Endodontics 2021;47:1340–4. https://doi.org/10.1016/j.joen.2021.07.015. Nawal RR, Logani A, Sangwan P, Ballal NV, Gopikrishna V. Indian Endodontic Society: Position statement for deep caries management and vital pulp therapy procedures. Endodontology 2023;35:167–94. https://doi.org/10.4103/endo.endo_155_23. European Society of Endodontology (ESE) developed by:, Duncan HF, Galler KM, et al. European Society of Endodontology position statement: Management of deep caries and the exposed pulp. Int Endodontic J 2019;52:923–34. https://doi.org/10.1111/iej.13080. Neuhaus KW, Kühnisch J, Banerjee A, et al. Organization for Caries Research-European Federation of Conservative Dentistry Consensus Report on Clinical Recommendations for Caries Diagnosis Paper II: Caries Lesion Activity and Progression Assessment. Caries Res 2024;58:511–20. https://doi.org/10.1159/000538619. Huysmans M-C, Fontana M, Lussi A, et al. European Organisation for Caries Research and the European Federation of Conservative Dentistry Consensus Report on Clinical Recommendations for Caries Diagnosis: Paper III – Caries Diagnosis at the Individual Level. Caries Res 2024;58:521–32. https://doi.org/10.1159/000539427. Kahler B, Taha N, Lu J, Saoud T. Vital pulp therapy for permanent teeth with diagnosis of irreversible pulpitis: biological basis and outcome. Australian Dental Journal 2023;68. https://doi.org/10.1111/adj.12997. Yong D, Cathro P. Conservative pulp therapy in the management of reversible and irreversible pulpitis. Australian Dental Journal 2021;66. https://doi.org/10.1111/adj.12841. Dammaschke T, Galler K, Krastl G. Current recommendations for vital pulp treatment n.d. Wolters WJ, Duncan HF, Tomson PL, et al. Minimally invasive endodontics: a new diagnostic system for assessing pulpitis and subsequent treatment needs. Int Endodontic J 2017;50:825–9. https://doi.org/10.1111/iej.12793. Mohammad‐Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endodontic J 2024;57:305–14. https://doi.org/10.1111/iej.14014. Johnson AJ, Singh TK, Gupta A. Artificial Intelligence in Conservative Dentistry and Endodontics. In: Gupta A, Singh TK, Singla E, eds. Application of Robotics in Dentistry . Singapore: Springer Nature Singapore; 2026. p. 233–56. Johnson AJ, Singh TK, Gupta A, et al. Evaluation of validity and reliability of AI Chatbots as public sources of information on dental trauma. Dental Traumatology 2024:edt.13000. https://doi.org/10.1111/edt.13000. Bernard A, Langille M, Hughes S, Rose C, Leddin D, Veldhuyzen Van Zanten S. A Systematic Review of Patient Inflammatory Bowel Disease Information Resources on the World Wide Web. Am J Gastroenterology 2007;102:2070–7. https://doi.org/10.1111/j.1572-0241.2007.01325.x. Bland JM, Altman DG. Statistics notes: Cronbach’s alpha. BMJ 1997;314:572–572. https://doi.org/10.1136/bmj.314.7080.572. Ayers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183:589. https://doi.org/10.1001/jamainternmed.2023.1838. Grassini E, Buzzi M, Leporini B, Vozna A. A systematic review of chatbots in inclusive healthcare: insights from the last 5 years. Univ Access Inf Soc 2025;24:195–203. https://doi.org/10.1007/s10209-024-01118-x. Büyüközer Özkan H, Doğan Çankaya T, Kölüş T. The Impact of Language Variability on Artificial Intelligence Performance in Regenerative Endodontics. Healthcare 2025;13:1190. https://doi.org/10.3390/healthcare13101190. Dufey-Portilla N, Frisman AB, Robles MG, et al. Assessing the validity of ChatGPT-4o and Google Gemini Advanced when responding to frequently asked questions in endodontics. J Appl Oral Sci 2025;33:e20250321. https://doi.org/10.1590/1678-7757-2025-0321. Molena KF, Macedo AP, Ijaz A, et al. Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model. Cureus 2024. https://doi.org/10.7759/cureus.65658. Ekmekci E, Durmazpinar PM. Evaluation of different artificial intelligence applications in responding to regenerative endodontic procedures. BMC Oral Health 2025;25:53. https://doi.org/10.1186/s12903-025-05424-5. Othman AA, Sharqawi AJ, MohammedAziz AA, Ali WA, Alatiyyah AA, Mirah MA. Assessing the Accuracy and Completeness of AI-Generated Dental Responses: An Evaluation of the Chat-GPT Model. Healthcare 2025;13:2144. https://doi.org/10.3390/healthcare13172144. Ozdemir ZM, Yapici E. Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry. J Esthet Restor Dent 2025;37:1740–52. https://doi.org/10.1111/jerd.13447. Arpaci A, Ozturk AU, Okur I, Sadry S. Evaluation of the accuracy of ChatGPT-4 and Gemini’s responses to the World Dental Federation’s frequently asked questions on oral health. BMC Oral Health 2025;25:1293. https://doi.org/10.1186/s12903-025-06624-9. Lanzafame LRM, Gulli C, Mazziotti S, et al. Chatbots in Radiology: Current Applications, Limitations and Future Directions of ChatGPT in Medical Imaging. Diagnostics 2025;15:1635. https://doi.org/10.3390/diagnostics15131635. Özbay Y, Erdoğan D, Dinçer GA. Evaluation of the performance of large language models in clinical decision-making in endodontics. BMC Oral Health 2025;25:648. https://doi.org/10.1186/s12903-025-06050-x. Char DS, Shah NH, Magnus D. Implementing Machine Learning in Health Care — Addressing Ethical Challenges. N Engl J Med 2018;378:981–3. https://doi.org/10.1056/NEJMp1714229. Farhadi Nia M, Ahmadi M, Irankhah E. Transforming dental diagnostics with artificial intelligence: advanced integration of ChatGPT and large language models for patient care. Front Dent Med 2025;5:1456208. https://doi.org/10.3389/fdmed.2024.1456208. Additional Declarations No competing interests reported. Supplementary Files SupplementalTable12.docx SupplementalTable22.docx Cite Share Download PDF Status: Under Review Version 1 posted Reviews received at journal 03 May, 2026 Reviews received at journal 29 Apr, 2026 Reviews received at journal 22 Apr, 2026 Reviews received at journal 06 Apr, 2026 Reviewers agreed at journal 06 Apr, 2026 Reviewers agreed at journal 05 Apr, 2026 Reviews received at journal 03 Apr, 2026 Reviewers agreed at journal 02 Apr, 2026 Reviewers agreed at journal 02 Apr, 2026 Reviewers agreed at journal 31 Mar, 2026 Reviewers agreed at journal 26 Mar, 2026 Reviews received at journal 03 Mar, 2026 Reviewers agreed at journal 08 Feb, 2026 Reviewers agreed at journal 05 Feb, 2026 Reviewers agreed at journal 28 Jan, 2026 Reviewers invited by journal 27 Jan, 2026 Editor invited by journal 09 Jan, 2026 Editor assigned by journal 16 Dec, 2025 Submission checks completed at journal 16 Dec, 2025 First submitted to journal 09 Dec, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8320702","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":582347727,"identity":"5e695d03-a283-4059-9eab-b60c327754ce","order_by":0,"name":"Ashish J Johnson","email":"","orcid":"","institution":"All India Institute of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Ashish","middleName":"J","lastName":"Johnson","suffix":""},{"id":582347730,"identity":"aefcbb29-9a8f-415b-8950-837b9ada2b0c","order_by":1,"name":"Tarun Kumar Singh","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABFklEQVRIiWNgGAWjYDCCAxDKAE4ZsDeAGBZ4tDCja+EBmWIgQawWsOIEEIVbC9/t8wc/V7bdMTa4dvjZxx8Fd+y2Sz6/uuFHgQQDf3t3AjYtkueSmSXPtj0zM7idZjybx+BZ8s7ZOWU3e4AOkzhzdgM2LQZnmBkkG9sO2xjcTjBmZjA4nGxwOyftBg9Qi4FELi4tzD8hWtI/M/4Aabl5Ju3mH/xa2EC2AB2WY8zAY3DYzuAG+7Hb+GyRPMNsZtlw7rCx5O2cYmaglgSDMzlst2UMJHhw+YXvDOPjmw1lhw37bqdvZvzx57C9wfHjz26++WMjx9/ei1ULBkhsADoPxOAhSjkI2DMwsD8gWvUoGAWjYBSMCAAAbQ9pOfGaPuIAAAAASUVORK5CYII=","orcid":"","institution":"All India Institute of Medical Sciences","correspondingAuthor":true,"prefix":"","firstName":"Tarun","middleName":"Kumar","lastName":"Singh","suffix":""},{"id":582347732,"identity":"e17f5f43-e68a-4053-b9a7-46b543811247","order_by":2,"name":"R Periyasamy","email":"","orcid":"","institution":"All India Institute of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"R","middleName":"","lastName":"Periyasamy","suffix":""},{"id":582347734,"identity":"a650967a-c9ce-4adb-a897-2f6ec659a8f8","order_by":3,"name":"Aakash Gupta","email":"","orcid":"","institution":"All India Institute of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Aakash","middleName":"","lastName":"Gupta","suffix":""},{"id":582347736,"identity":"0448fb7e-b442-46f4-a318-9f62e28ee6da","order_by":4,"name":"Ikroop Gill","email":"","orcid":"","institution":"All India Institute of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Ikroop","middleName":"","lastName":"Gill","suffix":""}],"badges":[],"createdAt":"2025-12-09 18:53:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8320702/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8320702/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":101498278,"identity":"40a87caf-6ff4-480f-9fd7-924d01ceb585","added_by":"auto","created_at":"2026-01-30 13:05:35","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":74404,"visible":true,"origin":"","legend":"\u003cp\u003eMean Scores of responses from all the chatbots\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8320702/v1/360faa57e825d452eeee6a6f.png"},{"id":101498282,"identity":"0740f651-b4f6-49e4-88f5-b8f1253ffc47","added_by":"auto","created_at":"2026-01-30 13:05:35","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":27486,"visible":true,"origin":"","legend":"\u003cp\u003eValid and Invalid responses on the low-threshold validity test.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-8320702/v1/1d957f309308c8e6bcba6dc9.png"},{"id":101498283,"identity":"1b286c5c-ecf6-402f-8bd2-1c6da161a495","added_by":"auto","created_at":"2026-01-30 13:05:35","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":30569,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eValid and invalid responses on high-threshold validity\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-8320702/v1/b196b0ae549d22df85371504.png"},{"id":101498281,"identity":"7f6bd7b3-dc2d-4cda-8f80-5fa40cc09542","added_by":"auto","created_at":"2026-01-30 13:05:35","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":18285,"visible":true,"origin":"","legend":"\u003cp\u003eMean reliability score of chatbots\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-8320702/v1/0c1c2bf5940c410d4fb8bdab.png"},{"id":101755227,"identity":"7579895c-2e47-4aeb-9f1d-6415c28aac5e","added_by":"auto","created_at":"2026-02-03 10:50:02","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":673130,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8320702/v1/ec22f55a-fc06-4996-a396-08f2f200e6ba.pdf"},{"id":101752130,"identity":"bc548edd-77ff-46a6-adc3-1faed9a9faac","added_by":"auto","created_at":"2026-02-03 10:25:35","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":28861,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementalTable12.docx","url":"https://assets-eu.researchsquare.com/files/rs-8320702/v1/9db6b84497e415ee5eb20263.docx"},{"id":101498280,"identity":"46173e40-d36d-4f84-8e48-4052da1e88e4","added_by":"auto","created_at":"2026-01-30 13:05:35","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":640963,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementalTable22.docx","url":"https://assets-eu.researchsquare.com/files/rs-8320702/v1/4ade559205df090d44be1830.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives","fulltext":[{"header":"Introduction","content":"\u003cp\u003eDeep caries management represents one of the most rapidly evolving in the field of contemporary restorative and endodontic practice. Two primary factors have contributed to the enhancement of this field: the development of enhanced biomaterials and a deeper understanding of pulpal biology. They have substituted conventional invasive treatments with biologically designed, minimally invasive therapies that retain the pulp vitality to a great extent possible\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e. Vital pulp therapy (VPT), previously seen as having limited applications in adult endodontics, is now a validated and evidence-based treatment procedure in suitably selected cases. Global restorative and endodontic organisations have established clinical practice guidelines and position statements to guide practitioners in this domain. All differ in minor perspectives, however, all articulate the same ultimate objective: to safeguard the pulp and maintain the vitality of the tooth\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eThe American Association of Endodontists (AAE) has developed a standardised diagnostic language and clinical guidelines that serve as the benchmark for care\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. The Indian Endodontic Society (IES) advocates for biologically oriented techniques and endorses treatment methods aimed at pulp preservation whenever possible\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e. The European Society of Endodontology (ESE) has established consensus standards that underscore pulp preservation techniques, including the incremental treatment of extensive caries\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e and also on pulpal diagnosis\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. Australia\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e and German\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e position statements predominantly conform to these principles, establishing only minimal parameters for the definitions of the diagnostic threshold, the timing of intervention, and the measurement of therapeutic outcomes. Wolters' method introduces an additional level of evidence to the categorisation of pulpal and periapical disorders. It not only standardises diagnoses but also facilitates the uniformity of clinical and research settings globally\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eDespite the numerous regulations, a significant problem persists: indviduals interpret the diagnosis on clinical experience over the evidence. Different clinicians and students may approach identical situations differently based on the significance they attribute to radiographic findings, case history, or the diagnostic method employed. These differences necessitate the execution of thorough evaluations, judicious case selection, and the formulation of individualized treatment plans. Global discourse among endodontists indicates that the management of deep caries is not merely a technical concern, but a complex process rooted in biological decision-making. The creation of shared guidelines suggests a shift towards consensus; nonetheless, variability indicates that clinical judgment remains fundamental to patient-centered treatment.\u003c/p\u003e \u003cp\u003eAI chatbots have revolutionized digital communication by markedly improving the quality of human connection. Utilising deep learning techniques, these chatbots are taught on vast datasets and perpetually enhance their answer precision and relevancy by simulating the human brain networks. Recent years have demonstrated a heightened dependence on AI chatbots as a public information source for patients and for clinicians in the medical domain. This has been studied in endodontics\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e and dental trauma\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e in recent years. However, investigations have demonstrated the prevalence of misinformation in some domains of medical sciences. Although all chatbots operate interactively with human inputs and strive to deliver optimal feedback, apprehensions have emerged over its validity and reliability of their responses, especially in medical domains.\u003c/p\u003e \u003cp\u003eThe utilisation of AI-based chatbots has gained popularity among students and clinicians for decision-making regarding diagnosis in situations of uncertainty in establishing an appropriate treatment plan and treatment facilities, respectively. The management of deep caries, as a developing concept, has consistently presented a problem in selecting the appropriate diagnosis. This study seeks to assess the reliability and validity of AI chatbots in the comparative diagnosis of cases based on various scenarios and their corresponding treatment plans according to different position statements on deep caries management by different global organisations from post graduate student and endodontist aspects.\u003c/p\u003e"},{"header":"Methodology","content":"\u003ch2\u003eData collection\u003c/h2\u003e\n\u003cp\u003eThe primary application programming interface (API) of each chatbot was used to simulate real-world interactions more accurately. For ChatGPT 5, the API was accessed through the URL https://chat.openai.com/ using the Pro version, with all questions and answers obtained on the same date. The Perplexity API was accessed via https://www.perplexity.ai/, with all questions asked on the same day. The Google Gemini API was accessed through https://gemini.google.com/ using the Google Chrome browser, and all questions were asked on the same date. The Claude AI API was accessed via https://claude.ai/chats/, with all questions asked on the same day. All the chatbots Pro versions were used to obtain quality answers and also to avoid the limited usability in free versions.\u003c/p\u003e\n\u003cp\u003eTen short case histories were fabricated, including scenarios of minimally invasive restorative to deep caries management with radiographic findings. The case histories are available on the Supplemental Table 1.\u0026nbsp; All the position statements on deep caries management were uploaded, including the AAE\u003csup\u003e3\u003c/sup\u003e, ESE\u003csup\u003e5\u003c/sup\u003e, IES\u003csup\u003e4\u003c/sup\u003e, AES\u003csup\u003e8,9\u003c/sup\u003eand \u0026nbsp;German\u003csup\u003e10\u003c/sup\u003e\u0026nbsp; Endodontic Societies and also the Wolters method\u003csup\u003e11\u003c/sup\u003e to provide additional evidence for the categorisation of pulpal diagnosis on each chatbot before giving prompts. Each of the four AI chatbots was then asked to provide a comparative diagnosis and management of the 10 case histories based on the position statements previously uploaded. To assess validity and reliability, each question was asked three times. A new chat session was created for each question on each chatbot, and the \u0026ldquo;More balanced\u0026rdquo; conversation style was selected for the responses. Each questions were asked one after another on the same chatbot after each response was produced. The three responses for each case history were recorded and used for scoring the validity and reliability.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eBefore evaluating the chatbot responses, in order to avoid the investigator bias by the clinician and the postgraduate students, Journal clubs and group discussions were conducted on each position statement. All responses were assessed by the two endodontic postgraduate students and three senior clinician which include two professors (A.J.J., P.R., T.K.S, A.G., I.G.) on a 5-point Likert scale. A revised iteration of the Global Quality Score (GQS)\u003csup\u003e15\u003c/sup\u003e was utilised to assign ratings based on the responses\u0026apos; \u0026quot;content\u0026quot; and \u0026quot;context.\u0026quot;\u003c/p\u003e\n\u003cp\u003e\u0026bull; Score 5 (Strongly Agree): The answer is correct, and the content is comprehensive.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026bull; Score 4 (Agree): The answer is correct and most of the content is correct, but it lacks information or contains incorrect information.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026bull; Score 3 (Neutral): The answer is somewhat correct, but details are primarily incorrect, missing, or irrelevant.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026bull; Score 2 (Disagree): The answer is incorrect, but the content includes some correct elements.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026bull; Score 1 (Strongly Disagree): The answer and the entire content are incorrect or irrelevant.\u003c/p\u003e\n\u003cp\u003eUpon completion of the scoring, it underwent evaluation for validity analysis. The reliability evaluation conducted by the examiners was quantified using a binary scale (0 and 1) and analysed by Cronbach\u0026apos;s alpha methodology\u003csup\u003e16\u003c/sup\u003e.\u003c/p\u003e\n\u003ch2\u003eAnalysis of Validity\u003c/h2\u003e\n\u003cp\u003eThe responses were scored and then classified as \u0026quot;valid\u0026quot; or \u0026quot;invalid,\u0026quot; reflecting how well each chatbot\u0026apos;s response matched the intended response. Two validity tests were employed: a low-threshold test and a high-threshold test. In the low-threshold test, a threshold score of 4 was established, whereby a response was considered genuine if the chatbot achieved a score of at least 4 from 3 responses. Scores under 4 were deemed invalid. In the high-threshold test, a response was deemed genuine only if all three of the responses obtained a perfect score of 5, with a threshold of 5. A score below 5 rendered the response invalid. The accuracy of responses from different chatbots was evaluated using the Fisher Exact \u0026nbsp;test.\u003c/p\u003e\n\u003ch2\u003eAnalysis of reliability\u003c/h2\u003e\n\u003cp\u003eReliability denotes the constancy with which the chatbot produces like responses under same settings. Cronbach\u0026apos;s alpha was computed for each trio of replies throughout the 20 questions to examine response consistency. Cronbach\u0026apos;s alpha is quantified on a scale ranging from 0 to 1, where 1 signifies perfect reliability and 0 denotes complete unreliability. A high alpha coefficient indicates that the chatbot frequently delivered comparable responses, signifying substantial reliability. A diminished alpha coefficient signifies reduced consistency, indicating lower reliability.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003eEach of the four chatbots has responded to the short case histories, providing a total of 120 responses for differential diagnosis based on the position statements. All the responses were recorded in Supplemental Table 2. During the repeated chatbot prompts, the descriptive explanation of the chatbots was reduced after the first responses, making the validity questionable on some of the confusing case history(6\u003csup\u003eth\u003c/sup\u003e) case history.\u003c/p\u003e\n\u003cp\u003eThe mean scores for ChatGPT, Perplexity, Claude, and Gemini were compared across ten evaluation questions. Results showed that Gemini and Perplexity consistently achieved higher mean scores for most questions, frequently scoring above 4, while ChatGPT and Claude generally had slightly lower scores(Figure 1). Gemini demonstrated strong performance, particularly on questions 4, 5, and 7, where its scores approached or reached 4.67 and 4.33, indicating robust consistency and user preference. Perplexity also performed well, with its highest mean scores of 5 on questions 4, 5, and 8,. ChatGPT\u0026rsquo;s mean scores ranged from 2.33 to 4.67, with lower averages evident on questions 2, 4, and 6, while Claude\u0026rsquo;s means varied from 2.67 to 5, peaking at question 7, but dipping on some others. Overall, the comparative findings highlight that Perplexity and Gemini frequently outperformed the others on these evaluation metrics, with ChatGPT and Claude showing slightly more variability.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLow-Threshold Validity Test\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn the low-threshold validity test, Perplexity, Claude, and Gemini demonstrated complete accuracy, each producing 100% valid responses with no invalid responses detected across all evaluation criteria. This indicates these models consistently met the validity requirements for every test item included in the assessment. In contrast, ChatGPT presented valid responses in 70% of cases, but 30% were deemed invalid, showing noticeably reduced reliability compared to its peers under the same low-threshold conditions (Figure 2). The findings highlight a clear performance distinction, with Perplexity, Claude, and Gemini exhibiting perfect validity rates while ChatGPT fell short, suggesting that the latter may require further improvement or review when assessed for low-threshold response validity.\u003c/p\u003e\n\u003cp\u003eThe intergroup comparison of low-threshold validity among the four chatbots revealed meaningful insights into their relative performance in generating valid responses. A 100% validity rate was observed for Perplexity, Claude, and Gemini, indicating that these chatbots consistently produced responses scoring 4 or higher across all three answer attempts per question. Conversely, ChatGPT demonstrated a notably lower low-threshold validity rate of 70%, with three questions falling below the validity threshold. Statistical testing using pairwise comparisons showed a significant difference between Perplexity and ChatGPT (P=0.049), suggesting that Perplexity\u0026rsquo;s responses were significantly more valid under this criterion. Similarly, ChatGPT\u0026rsquo;s performance was significantly poorer than Claude and Gemini, with P-values of 0.049 in both comparisons, indicating a consistent trend of lower validity for ChatGPT compared to these models. However, comparisons among Perplexity, Claude, and Gemini did not reveal any statistically significant differences, reflecting comparable validity profiles at the low-threshold level for these three chatbots (Table1).\u003c/p\u003e\n\u003cp\u003eThese findings underscore variability in chatbot response accuracy at a less stringent threshold, highlighting ChatGPT as less reliable in consistently meeting the validity cut-off. The high validity rates for Perplexity, Claude, and Gemini demonstrate robust performance in generating acceptable answers across question repetitions, a desirable trait for medical or scientific chatbot applications.\u003c/p\u003e\n\u003cp\u003eTable 1 :Intergroup comparison of low threshold validity (*-significant)(x-Non significant)\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" align=\"\" width=\"655\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 131px;\"\u003e\n \u003cp\u003ePerplexity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 131px;\"\u003e\n \u003cp\u003eChat GPT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 131px;\"\u003e\n \u003cp\u003eClaude\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 131px;\"\u003e\n \u003cp\u003eGemini\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003ePerplexity\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=1.000(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=1.000(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eChat GPT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eClaude\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=1.000(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=1.000(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eGemini\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=1.000(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003eP=1.000(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 131px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eHigh-Threshold Validity\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eIn this high-threshold validity test, Perplexity achieved 30% valid responses, showing better performance than other chatbots but still a majority of invalid responses at 70%. Claude produced only 10% valid responses and 90% invalid, indicating low efficacy under stringent validity criteria. ChatGPT and Gemini scored no valid responses, with all answers classified as invalid, highlighting considerable challenges in meeting high-threshold validity requirements in this assessment. Overall, the high-threshold test revealed substantial difficulty for all chatbots, with Perplexity showing the highest but still limited validity, while ChatGPT and Gemini failed to provide any valid responses.In the high-threshold validity test, all chatbots faced considerable difficulty in producing valid responses. Perplexity showed relatively better performance with 30% valid responses, although the majority (70%) were invalid. Claude achieved only 10% validity, with 90% invalid responses, indicating limited accuracy under more stringent criteria. ChatGPT and Gemini struggled the most, with zero valid responses and all their answers classified as invalid. These results suggest that as the validity threshold increases, the ability of these chatbots to provide accurate responses significantly diminishes with Perplexity performing the best among them, and ChatGPT and Gemini demonstrating the greatest challenges to meeting high-threshold validity standards.\u003c/p\u003e\n\u003cp\u003eThe intergroup comparison of high threshold validity among the four chatbots revealed a general decline in validity rates compared to the low threshold criteria, with more variable performance observed across the groups. Perplexity achieved the highest validity rate of 30%, followed by Claude with 10%, while both ChatGPT and Gemini had validity rates of 0%. Statistical analysis showed a significant difference between Perplexity and ChatGPT (P=0.049) and between Perplexity and Gemini (P=0.049), indicating that Perplexity\u0026apos;s responses were significantly more valid at this stricter criterion. In contrast, comparisons between Perplexity and Claude (P=0.263), ChatGPT and Claude (P=0.304), ChatGPT and Gemini (P=1.000), and Claude and Gemini (P=0.304) were not statistically significant, reflecting no meaningful difference in performance among these pairs(Table 2).\u003c/p\u003e\n\u003cp\u003eThese results suggest that while Perplexity maintained relatively better high-threshold validity, the other chatbots struggled to consistently produce perfect responses across all three repetitions, indicating challenges in achieving the highest level of accuracy.\u003c/p\u003e\n\u003cp\u003eTable 2: Intergroup comparison of High-threshold validity (*-significant)(x-Non significant)\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"605\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 102px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003ePerplexity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 118px;\"\u003e\n \u003cp\u003eChat GPT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 113px;\"\u003e\n \u003cp\u003eClaude\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 132px;\"\u003e\n \u003cp\u003eGemini\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 102px;\"\u003e\n \u003cp\u003ePerplexity\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 139px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 118px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 113px;\"\u003e\n \u003cp\u003eP=0.263 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 132px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 102px;\"\u003e\n \u003cp\u003eChat GPT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 118px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 113px;\"\u003e\n \u003cp\u003eP=0.304 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 132px;\"\u003e\n \u003cp\u003eP=1.000 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 102px;\"\u003e\n \u003cp\u003eClaude\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.263 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 118px;\"\u003e\n \u003cp\u003eP=0.304 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 113px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 132px;\"\u003e\n \u003cp\u003eP=0.304 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 102px;\"\u003e\n \u003cp\u003eGemini\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.049 (*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 118px;\"\u003e\n \u003cp\u003eP=1.000 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 113px;\"\u003e\n \u003cp\u003eP=0.304 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 132px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eReliability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePerplexity demonstrated the highest mean reliability score of 0.90 with a standard deviation of 0.31, indicating strong and consistent reliability across measurements. Claude followed with a mean reliability of 0.80 but showed greater variability with an SD of 0.48, suggesting some inconsistency in its performance. Gemini had a moderate mean reliability of 0.70 with a relatively higher SD of 0.52, reflecting more fluctuations in reliability compared to Perplexity and Claude. ChatGPT exhibited the lowest mean reliability at 0.50, coupled with the highest variability (SD = 0.42), indicating less consistent reliability among the groups assessed(Figure 4). Overall, Perplexity appears to be the most reliable chatbot with consistent performance, while ChatGPT demonstrated the least reliability and the greatest inconsistency in this comparison.\u003c/p\u003e\n\u003cp\u003eThe intergroup comparison of reliability among the four chatbots revealed significant differences in the consistency of their responses. Perplexity exhibited the highest mean reliability score of 0.90 with a standard deviation of 0.31, indicating a stable and consistent response pattern. ChatGPT showed a notably lower mean reliability of 0.50 with greater variability (SD = 0.42), suggesting less consistent responses across repeated questions. Claude and Gemini had intermediate mean reliability scores of 0.80 and 0.70, respectively, with relatively higher standard deviations (0.48 and 0.52), indicating moderate consistency.\u003c/p\u003e\n\u003cp\u003eStatistical analysis showed that the difference in reliability between Perplexity and ChatGPT was significant (P = 0.026), highlighting Perplexity\u0026rsquo;s superior consistency. However, comparisons between Perplexity and Claude (P = 0.586) and Perplexity and Gemini (P = 0.310) were not statistically significant, reflecting comparable reliability performance among these groups. Similarly, reliability differences between ChatGPT and Claude (P = 0.154), ChatGPT and Gemini (P = 0.354), and Claude and Gemini (P = 0.660) were also not significant, indicating no clear superiority among these pairs(Table 3).\u003c/p\u003e\n\u003cp\u003eThese findings suggest that Perplexity provides the most reliable and consistent responses among the chatbots evaluated. The moderate variability in reliability among the other chatbots, particularly ChatGPT, may impact their dependability in applications where consistent response quality is critical.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTable 3: Intergroup reliability comparison.(*-significant)(x-Non significant)\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"676\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 121px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003ePerplexity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eChat GPT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eClaude\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eGemini\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 121px;\"\u003e\n \u003cp\u003ePerplexity\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 139px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.026(*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.586(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.310(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 121px;\"\u003e\n \u003cp\u003eChat GPT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.026(*)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.154 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.354 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 121px;\"\u003e\n \u003cp\u003eClaude\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.586(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.154 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.660 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 121px;\"\u003e\n \u003cp\u003eGemini\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.310(x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.354 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003eP=0.660 (x)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd style=\"width: 139px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"},{"header":"Discussion","content":"\u003cp\u003eAI chatbots have emerged as a prominent source of information in recent times\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e. Their integration into the medical sector has enhanced resource efficiency and diminished the necessity for substantial labour, hence increasing the accessibility of medical information to the public, students and clinicians\u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e,\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. The present study set out to evaluate the validity and reliability of AI chatbot responses used for the comparative diagnosis and definitive management of deep caries, guided by position statements from leading global endodontic organizations. Our findings indicate marked variations among the four evaluated chatbots\u0026mdash;ChatGPT, Perplexity, Claude, and Gemini\u0026mdash;in both low-threshold and high‐threshold validity tests as well as in reliability measures, which have important implications for integrating such AI tools into clinical practice for deep caries management\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. The difficulty in a definitive diagnosis on deep caries management, particularly when vital pulp therapy is indicated.\u003c/p\u003e \u003cp\u003eIn the low-threshold validity test, where a response was deemed acceptable if it achieved a score of at least 4 on a 5‐point scale in three repeated iterations, Perplexity, Claude, and Gemini each produced 100% valid responses, whereas ChatGPT achieved only a 70% validity rate. This discrepancy suggests that, although the majority of the chatbots are capable of retrieving and synthesizing guideline‐based information, there remains substantial risk with certain platforms\u0026mdash;most notably ChatGPT\u0026mdash;in delivering responses that meet even minimal clinical standards\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e.Under conditions that demanded complete perfection\u0026mdash;a high‐threshold validity test where only responses scoring a perfect 5 on all three attempts were considered valid\u0026mdash;the performance of the evaluated chatbots deteriorated sharply. Perplexity achieved a 30% validity rate under these stringent standards, while both ChatGPT and Gemini recorded 0% valid responses, and Claude fell short with merely 10% valid responses. These findings highlight the difficulties AI systems encounter when adhering to rigorous criteria of clinical accuracy and comprehensiveness, especially in intricate diagnostic situations like deep caries management\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Case history number 6 on the list challenged the validity of the all the chatbots on each response are a perfect example of their diagnostic ability.\u003c/p\u003e \u003cp\u003eWith respect to the reliability, measured by using Cronbach\u0026rsquo;s alpha for repeated responses to the same prompts(case histories), Perplexity demonstrated the highest consistency with a mean reliability score of 0.90 and a modest standard deviation, indicating that its responses remained highly stable across trials. In contrast, ChatGPT exhibited a mean reliability of merely 0.50 with a higher standard deviation, signaling significant inconsistency in its performance. Claude and Gemini fall in between these extremes, with mean reliability scores of 0.80 and 0.70 respectively, alltogether with greater variance than Perplexity. This inter-platform variability in reliability is likely attributable to differences in model architecture, training data, and response generation algorithms, all of which influence the capacity of an AI system to consistently align with established clinical guidelines\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e,\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003e Position statements and guidelines provided by authoritative bodies like the American Association of Endodontists (AAE), the Indian Endodontic Society (IES), and the European Society of Endodontology (ESE) emphasize minimally invasive, biologically oriented approaches to pulp preservation. Our study used these position statements as the benchmark for assessing the chatbot responses, yet even when provided with extensive guideline input, the AI platforms exhibited significant inconsistencies, particularly demonstrating poor performance under high-threshold conditions\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e,\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003e The poor high-threshold performance of ChatGPT and Gemini can be due to the inherent complexity in translating sophisticated clinical guidelines into precise, contextually relevant diagnostic recommendations. In many cases, an AI model may retrieve pertinent information yet fail to integrate it in a manner that reflects the full breadth of clinical complexity\u0026mdash;potentially due to limitations in the training data or the inability to contextualize radiographic and clinical findings in a manner that mirrors the judgment of experienced clinicians\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e,\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e.Another critical dimension of our evaluation is the methodological design that involved posing ten fabricated case histories based on a range of scenarios\u0026mdash;from minimally invasive restorative treatments to complex deep caries management involving distinct radiographic findings\u0026mdash;and repeating every query thrice to assess intra‐model consistency. This rigorous experimental design enabled us to identify not only the average performance metrics of each chatbot but also the variability inherent in their responses. The Global Quality Score (GQS) employed for evaluating responses offered a detailed assessment of content accuracy and contextual appropriateness. Even minor deviations from anticipated responses can render a case invalid under stringent criteria, underscoring the sensitivity of AI systems to input variations and the necessity for near-perfect consistency in domains where patient outcomes rely on precise diagnostic and treatment protocols\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e.Beyond the quantitative metrics, the variability in chatbot performance carries important implications for clinical decision‐making and patient management. In modern dental practice\u0026mdash;particularly in endodontics\u0026mdash;the misdiagnosis or mismanagement of deep caries can lead to either overtreatment, which may unnecessarily compromise the vitality of the pulp, or undertreatment, with the risk of further pulp degeneration and infection. The present study\u0026rsquo;s findings suggest that the inconsistent performance of some AI chatbots may, at present, compromise their utility as stand‐alone diagnostic tools, advocating instead for their role as adjunctive aids that complement, rather than replace, expert clinical judgment. Ethical considerations regarding the information disseminated by AI chatbots must be meticulously handled\u003csup\u003e\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eIn practical terms, the adoption of AI chatbots in daily dental practice and in post graduate training should be viewed as a supplementary resource rather than a definitive diagnostic instrument. Routine use in areas such as second opinion generation, patient education, and preliminary triaging of deep caries cases can provide valuable support to clinicians, particularly in resource-limited settings or in cases where immediate expert consultation is not available \u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e. Nevertheless, the current study\u0026rsquo;s findings caution against over-reliance on any single AI platform until improvements in both validity and reliability are consistently demonstrated across a wider range of clinical scenarios. It is crucial to remember that the case histories selected might affect the replies and outcomes. The study's limitations encompass possible discrepancies in assessments by individual evaluators. Future research could be enhanced by including a greater quantity of case histories and other clinical scenarios, as well as by engaging additional experts in the evaluation panel.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study demonstrated that Perplexity exhibited the highest reliability and validity in deep caries diagnosis and management compared with ChatGPT, Claude, and Gemini. The observed variability and decline in descriptive depth across all models underscore their current limitations for direct clinical application. While these chatbots show potential as educational aids in endodontics, they cannot substitute for professional judgment. Future investigations should examine underlying performance mechanisms, evaluate broader endodontic contexts, and assess prompt design and multimodal integration. Establishing specialized, evidence-based AI models supported by regulatory oversight is essential to ensure accuracy, reliability, and transparency in clinical implementation.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eCOMPETING INTEREST\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eETHICS DECLARATION\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFUNDING\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors received no specific funding for this work\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCONSENT TO PUBLISH DECLARATION\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCONSENT TO PARTICIPATE DECLARATION\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAVAILABILITY OF DATA AND MATERIALS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets generated and/or analyzed during the present study are not publicly available due to patient privacy restrictions, but can be obtained from the corresponding author upon reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eACKNOWLEDGEMENTS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work has been done in the All India Institute of Medical Sciences (AIIMS), Bathinda\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAUTHOR CONTRIBUTIONS\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll the authors have made relevant contributions to the manuscript. The Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Roles/Writing \u0026ndash; original draft, Writing \u0026ndash; review \u0026amp; editing was done by all the \u0026nbsp;five authors and Project administration, Resources, Software, Supervision, Validation, Visualization, \u0026nbsp;Roles/Writing \u0026ndash; original draft, Formal analysis, Writing \u0026ndash; review \u0026amp; editing,Validation was done by the first and second \u0026nbsp;author.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eFraser J, Webster S. What do we really know about vital pulp therapy? Evid Based Dent 2024;25:102\u0026ndash;3. https://doi.org/10.1038/s41432-024-01008-4.\u003c/li\u003e\n\u003cli\u003eColloc TNE, Tomson PL. Vital pulp therapies in permanent teeth: what, when, where, who, why and how? Br Dent J 2025;238:458\u0026ndash;68. https://doi.org/10.1038/s41415-025-8560-3.\u003c/li\u003e\n\u003cli\u003eAAE Position Statement on Vital Pulp Therapy. Journal of Endodontics 2021;47:1340\u0026ndash;4. https://doi.org/10.1016/j.joen.2021.07.015.\u003c/li\u003e\n\u003cli\u003eNawal RR, Logani A, Sangwan P, Ballal NV, Gopikrishna V. Indian Endodontic Society: Position statement for deep caries management and vital pulp therapy procedures. Endodontology 2023;35:167\u0026ndash;94. https://doi.org/10.4103/endo.endo_155_23.\u003c/li\u003e\n\u003cli\u003eEuropean Society of Endodontology (ESE) developed by:, Duncan HF, Galler KM, et al. European Society of Endodontology position statement: Management of deep caries and the exposed pulp. Int Endodontic J 2019;52:923\u0026ndash;34. https://doi.org/10.1111/iej.13080.\u003c/li\u003e\n\u003cli\u003eNeuhaus KW, K\u0026uuml;hnisch J, Banerjee A, et al. Organization for Caries Research-European Federation of Conservative Dentistry Consensus Report on Clinical Recommendations for Caries Diagnosis Paper II: Caries Lesion Activity and Progression Assessment. Caries Res 2024;58:511\u0026ndash;20. https://doi.org/10.1159/000538619.\u003c/li\u003e\n\u003cli\u003eHuysmans M-C, Fontana M, Lussi A, et al. European Organisation for Caries Research and the European Federation of Conservative Dentistry Consensus Report on Clinical Recommendations for Caries Diagnosis: Paper III \u0026ndash; Caries Diagnosis at the Individual Level. Caries Res 2024;58:521\u0026ndash;32. https://doi.org/10.1159/000539427.\u003c/li\u003e\n\u003cli\u003eKahler B, Taha N, Lu J, Saoud T. Vital pulp therapy for permanent teeth with diagnosis of irreversible pulpitis: biological basis and outcome. Australian Dental Journal 2023;68. https://doi.org/10.1111/adj.12997.\u003c/li\u003e\n\u003cli\u003eYong D, Cathro P. Conservative pulp therapy in the management of reversible and irreversible pulpitis. Australian Dental Journal 2021;66. https://doi.org/10.1111/adj.12841.\u003c/li\u003e\n\u003cli\u003eDammaschke T, Galler K, Krastl G. Current recommendations for vital pulp treatment n.d.\u003c/li\u003e\n\u003cli\u003eWolters WJ, Duncan HF, Tomson PL, et al. Minimally invasive endodontics: a new diagnostic system for assessing pulpitis and subsequent treatment needs. Int Endodontic J 2017;50:825\u0026ndash;9. https://doi.org/10.1111/iej.12793.\u003c/li\u003e\n\u003cli\u003eMohammad‐Rahimi H, Ourang SA, Pourhoseingholi MA, Dianat O, Dummer PMH, Nosrat A. Validity and reliability of artificial intelligence chatbots as public sources of information on endodontics. Int Endodontic J 2024;57:305\u0026ndash;14. https://doi.org/10.1111/iej.14014.\u003c/li\u003e\n\u003cli\u003eJohnson AJ, Singh TK, Gupta A. Artificial Intelligence in Conservative Dentistry and Endodontics. In: Gupta A, Singh TK, Singla E, eds. \u003cem\u003eApplication of Robotics in Dentistry\u003c/em\u003e. Singapore: Springer Nature Singapore; 2026. p. 233\u0026ndash;56.\u003c/li\u003e\n\u003cli\u003eJohnson AJ, Singh TK, Gupta A, et al. Evaluation of validity and reliability of AI Chatbots as public sources of information on dental trauma. Dental Traumatology 2024:edt.13000. https://doi.org/10.1111/edt.13000.\u003c/li\u003e\n\u003cli\u003eBernard A, Langille M, Hughes S, Rose C, Leddin D, Veldhuyzen Van Zanten S. A Systematic Review of Patient Inflammatory Bowel Disease Information Resources on the World Wide Web. Am J Gastroenterology 2007;102:2070\u0026ndash;7. https://doi.org/10.1111/j.1572-0241.2007.01325.x.\u003c/li\u003e\n\u003cli\u003eBland JM, Altman DG. Statistics notes: Cronbach\u0026rsquo;s alpha. BMJ 1997;314:572\u0026ndash;572. https://doi.org/10.1136/bmj.314.7080.572.\u003c/li\u003e\n\u003cli\u003eAyers JW, Poliak A, Dredze M, et al. Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum. JAMA Intern Med 2023;183:589. https://doi.org/10.1001/jamainternmed.2023.1838.\u003c/li\u003e\n\u003cli\u003eGrassini E, Buzzi M, Leporini B, Vozna A. A systematic review of chatbots in inclusive healthcare: insights from the last 5 years. Univ Access Inf Soc 2025;24:195\u0026ndash;203. https://doi.org/10.1007/s10209-024-01118-x.\u003c/li\u003e\n\u003cli\u003eB\u0026uuml;y\u0026uuml;k\u0026ouml;zer \u0026Ouml;zkan H, Doğan \u0026Ccedil;ankaya T, K\u0026ouml;l\u0026uuml;ş T. The Impact of Language Variability on Artificial Intelligence Performance in Regenerative Endodontics. Healthcare 2025;13:1190. https://doi.org/10.3390/healthcare13101190.\u003c/li\u003e\n\u003cli\u003eDufey-Portilla N, Frisman AB, Robles MG, et al. Assessing the validity of ChatGPT-4o and Google Gemini Advanced when responding to frequently asked questions in endodontics. J Appl Oral Sci 2025;33:e20250321. https://doi.org/10.1590/1678-7757-2025-0321.\u003c/li\u003e\n\u003cli\u003eMolena KF, Macedo AP, Ijaz A, et al. Assessing the Accuracy, Completeness, and Reliability of Artificial Intelligence-Generated Responses in Dentistry: A Pilot Study Evaluating the ChatGPT Model. Cureus 2024. https://doi.org/10.7759/cureus.65658.\u003c/li\u003e\n\u003cli\u003eEkmekci E, Durmazpinar PM. Evaluation of different artificial intelligence applications in responding to regenerative endodontic procedures. BMC Oral Health 2025;25:53. https://doi.org/10.1186/s12903-025-05424-5.\u003c/li\u003e\n\u003cli\u003eOthman AA, Sharqawi AJ, MohammedAziz AA, Ali WA, Alatiyyah AA, Mirah MA. Assessing the Accuracy and Completeness of AI-Generated Dental Responses: An Evaluation of the Chat-GPT Model. Healthcare 2025;13:2144. https://doi.org/10.3390/healthcare13172144.\u003c/li\u003e\n\u003cli\u003eOzdemir ZM, Yapici E. Evaluating the Accuracy, Reliability, Consistency, and Readability of Different Large Language Models in Restorative Dentistry. J Esthet Restor Dent 2025;37:1740\u0026ndash;52. https://doi.org/10.1111/jerd.13447.\u003c/li\u003e\n\u003cli\u003eArpaci A, Ozturk AU, Okur I, Sadry S. Evaluation of the accuracy of ChatGPT-4 and Gemini\u0026rsquo;s responses to the World Dental Federation\u0026rsquo;s frequently asked questions on oral health. BMC Oral Health 2025;25:1293. https://doi.org/10.1186/s12903-025-06624-9.\u003c/li\u003e\n\u003cli\u003eLanzafame LRM, Gulli C, Mazziotti S, et al. Chatbots in Radiology: Current Applications, Limitations and Future Directions of ChatGPT in Medical Imaging. Diagnostics 2025;15:1635. https://doi.org/10.3390/diagnostics15131635.\u003c/li\u003e\n\u003cli\u003e\u0026Ouml;zbay Y, Erdoğan D, Din\u0026ccedil;er GA. Evaluation of the performance of large language models in clinical decision-making in endodontics. BMC Oral Health 2025;25:648. https://doi.org/10.1186/s12903-025-06050-x.\u003c/li\u003e\n\u003cli\u003eChar DS, Shah NH, Magnus D. Implementing Machine Learning in Health Care \u0026mdash; Addressing Ethical Challenges. N Engl J Med 2018;378:981\u0026ndash;3. https://doi.org/10.1056/NEJMp1714229.\u003c/li\u003e\n\u003cli\u003eFarhadi Nia M, Ahmadi M, Irankhah E. Transforming dental diagnostics with artificial intelligence: advanced integration of ChatGPT and large language models for patient care. Front Dent Med 2025;5:1456208. https://doi.org/10.3389/fdmed.2024.1456208.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-oral-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ohea","sideBox":"Learn more about [BMC Oral Health](http://bmcoralhealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/ohea/default.aspx","title":"BMC Oral Health","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8320702/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8320702/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eAim:\u003c/h2\u003e \u003cp\u003eThis study aimed to evaluate the validity and reliability of prominent AI chatbots\u0026mdash;ChatGPT, Perplexity, Claude, and Gemini\u0026mdash;in the comparative diagnosis and definitive management of deep caries, guided by global position statements from endodontic organizations, as assessed by post-graduate students and clinicians.\u003c/p\u003e\u003ch2\u003eMethods:\u003c/h2\u003e \u003cp\u003eFour AI chatbots (ChatGPT, Perplexity, Claude, and Gemini) were accessed through their respective APIs using pro versions. Ten short case histories representing a spectrum of deep caries scenarios, along with corresponding position statements from the European Society of Endodontology, American Association of Endodontists, Indian Endodontic Society and others, were provided to each chatbot. Chatbots were prompted to generate diagnostic and management responses, which were repeated thrice per case per chatbot. Responses were evaluated by two postgraduate students and three senior clinicians using a 5-point Likert scale and an adapted Global Quality Score (GQS) for validity, and Cronbach\u0026rsquo;s alpha for reliability. Statistical analysis included low- and high-threshold validity tests and intergroup reliability comparisons.\u003c/p\u003e\u003ch2\u003eConclusion:\u003c/h2\u003e \u003cp\u003ePerplexity exhibited the highest reliability and validity in deep caries diagnosis and management compared to ChatGPT, Claude, and Gemini. While Perplexity, Claude, and Gemini demonstrated perfect or near-perfect validity at low-threshold criteria, only Perplexity maintained moderate validity at high-stringency levels. Overall variability and reduced descriptive depth across all chatbot outputs highlight current limitations for clinical implementation. AI chatbots may serve as useful educational or adjunctive tools but cannot substitute professional judgment in endodontic diagnosis and treatment. Future development should focus on enhancing performance mechanisms and regulatory oversight to support clinical accuracy and reliability.\u003c/p\u003e","manuscriptTitle":"Validity and reliability of AI chatbots on the comparative diagnosis and definitive management of deep caries based on position statements evaluated from post-graduate students and clinicians' perspectives","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-30 13:05:30","doi":"10.21203/rs.3.rs-8320702/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"editorInvitedReview","content":"","date":"2026-05-03T17:33:11+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-30T01:20:41+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-22T15:52:49+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-06T08:28:13+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"144117849682195078030593420732425298738","date":"2026-04-06T07:44:04+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"271387176305585743375019227264416203776","date":"2026-04-05T07:35:28+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-03T16:01:10+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"217399345567208062112725396222924863960","date":"2026-04-02T17:42:59+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"235274245729423563175387819066507946707","date":"2026-04-02T10:06:28+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"269486731778573146870323581934046153358","date":"2026-03-31T06:28:58+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"31226937825951855815980790313514954971","date":"2026-03-26T10:53:15+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-03T11:23:31+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"318490412263692126578684045836964293749","date":"2026-02-08T20:03:28+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"312429941292188653262422467226283232091","date":"2026-02-05T14:29:00+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"197921021550372253705519451676309766404","date":"2026-01-28T08:18:58+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-01-28T03:31:24+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-01-09T11:10:53+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-12-17T04:15:51+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-12-17T04:15:50+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Oral Health","date":"2025-12-09T18:40:45+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-oral-health","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ohea","sideBox":"Learn more about [BMC Oral Health](http://bmcoralhealth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/ohea/default.aspx","title":"BMC Oral Health","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"1ffb912c-a542-4e73-890f-7a0b3b097b1e","owner":[],"postedDate":"January 30th, 2026","published":true,"recentEditorialEvents":[{"type":"editorInvitedReview","content":"","date":"2026-05-03T17:33:11+00:00","index":136,"fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-30T01:20:41+00:00","index":135,"fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-01-30T13:05:30+00:00","versionOfRecord":[],"versionCreatedAt":"2026-01-30 13:05:30","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8320702","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8320702","identity":"rs-8320702","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00