Ai Chatbots for Pediatric Fluoride Education: An Effectiveness Study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Ai Chatbots for Pediatric Fluoride Education: An Effectiveness Study Nevra Karamüftüoğlu, Ezgi Aydın Varol, Cenkhan Bal This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6993451/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 29 Nov, 2025 Read the published version in Scientific Reports → Version 1 posted 16 You are reading this latest preprint version Abstract Background: Fluoride is a cornerstone of preventive pediatric dentistry, yet public concerns and online misinformation continue to undermine its acceptance. With the rise of large language model-based (LLM) chatbots, artificial intelligence (AI) tools have emerged as potential resources for delivering accessible, evidence-based health information. Objective: This study aims to evaluate the performance of three advanced AI chatbots—ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3—in providing fluoride-related information to parents and caregivers, with a specific focus on pediatric dental health. Methods: Twenty frequently asked fluoride-related questions were presented to each chatbot in standardized sessions. Responses were assessed by three blinded evaluators using validated tools: EQIP, DISCERN, Global Quality Scale (GQS), Flesch Reading Ease Score (FRES), Flesch-Kincaid Reading Grade Level (FKRGL), and iThenticate similarity index. Inter-rater reliability was ensured via intraclass correlation coefficients (ICCs). Statistical analysis was performed using ANOVA or Kruskal–Wallis tests, with appropriate post-hoc methods. Results: ChatGPT-4.o outperformed the other models in EQIP and DISCERN scores (p < 0.001), indicating higher reliability and informational quality. While FRES and Similarity Index showed no significant differences, ChatGPT produced more readable and original content. All three models showed moderate variability in FKRGL and GQS outcomes. Conclusion: Among the evaluated AI chatbots, ChatGPT-4.o demonstrated superior performance in conveying fluoride-related information in a clear, reliable, and evidence-based manner. While promising as educational tools in pediatric oral health, these models should be complemented with professional oversight to ensure accuracy and appropriateness in clinical use. Health sciences/Health care Health sciences/Medical research Artificial Intelligence Chatbots Fluoride Pediatric Dentistry Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 Figure 10 1. INTRODUCTION Fluoride remains a cornerstone of preventive dentistry due to its well-established effects on inhibiting demineralization, enhancing remineralization, and reducing cariogenic bacterial activity (ten Cate, 2013 ). Community-level strategies such as water fluoridation and fluoride toothpaste use have significantly reduced caries prevalence and are considered both effective and cost-efficient (CDC, 2001; Griffin et al., 2007). In pediatric populations, professionally applied fluoride varnishes and gels are routinely recommended, particularly for children at increased risk of caries (Weyant et al., 2013 ). Despite its proven efficacy, fluoride continues to be subject to public misperceptions and misinformation, particularly in digital environments. Parental concerns, safety hesitancy, and conflicting online narratives often hinder the adoption of fluoride-based preventive measures (Samaranayake et al., 2025 ). In this context, there is a growing need for trustworthy, accessible, and comprehensible digital resources that can effectively deliver fluoride-related information, especially to caregivers (Spittle, 2024 ). Recent developments in artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT and Google Gemini, have introduced new avenues for digital health communication. These AI chatbots are capable of providing on-demand, semantically rich, and interactive responses that can enhance user understanding of health topics, including oral care (Jin et al., 2024 ; Zhou et al., 2022 ; Pupong et al., 2025 ). Studies have shown that AI tools designed with reliable and user-centered principles can facilitate health behavior change and support decision-making in pediatric care (Li et al., 2023 ; Han & Lee, 2023 ; Bhattacharya & Pissurlenkar, 2023 ). As online health-seeking behavior becomes more prevalent—with over 80% of individuals relying on internet searches for medical guidance—there is increasing demand for credible and readable content that aligns with evidence-based guidelines (Buldur & Sezer, 2024 ). LLMs have rapidly gained popularity due to their scalability and natural language generation capabilities. Models such as ChatGPT-4.o, Google Gemini Pro, and the newly released DeepSeek V3 are now widely used in health information delivery. These systems differ in architecture, source access, and training data, which may influence the consistency, quality, and originality of their responses (DeepSeek Research Team, 2025 ). While AI chatbots offer promising solutions to enhance oral health literacy, concerns remain about the factual accuracy, ethical transparency, and clinical applicability of their outputs. Therefore, critical evaluation of their performance in conveying fluoride information is essential. This study aims to assess and compare the responses of four advanced AI chatbots— ChatGPT-4.o (OpenAI, San Francisco, CA, USA), Google Gemini Pro (Google DeepMind), and DeepSeek V3 (Hangzhou DeepSeek AI Co., Ltd.)—across dimensions of accuracy, reliability, readability, and originality, specifically in the context of pediatric dentistry and caregiver education. 2. MATERIALS AND METHODS 2.1. Formulation of Questions on Flouride This study followed a multi-step methodology to formulate, distribute, and analyze fluoride-related queries presented to the selected chatbot models (Fig. 1 ). Questions regarding fluoride were developed based on the most recent guideline on fluoride published by the American Academy of Pediatric Dentistry (15). Before formulating the questions, a preliminary search was conducted using the term "fluoride" via the Google search engine (1998, USA) to gain background information and to identify the most common topics of interest among parents. Subsequently, the guideline was reviewed with a focus on the most frequently asked questions by parents, and an initial question set was developed by two pediatric dentists with twenty-five years of clinical experience. This initial set was then reviewed and revised for clarity and relevance by a third pediatric dentist. These questions were directed to ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3 without using any guiding prompts. A separate conversation window was opened in each chatbot interface for each question. The final set of questions was classified under two main categories: General information (basic concepts and evidence-based knowledge) Clinical evaluations/clinical applications (practical aspects and procedural details) All questions were presented to each AI model in independent conversation sessions (Fig. 2 ). This approach was implemented to ensure standardization during the evaluation process. 2.2. Identification of Questions Related to Fluoride The methodological framework of the study is summarized in Fig. 1 . As no human or animal material was used in this study, ethical approval was not required. To gather information related to fluoride, the following AI models were utilized: ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3. The aim of the study was to evaluate the accuracy, completeness, and clarity of the information provided by these AI models to patients and parents regarding this procedure. To achieve this aim, the AI models’ ability to respond to targeted questions and engage in simulated patient-provider dialogues was assessed. The evaluation was conducted using predefined questions that reflect commonly asked patient inquiries related to fluoride. The responses generated by the models were systematically analyzed to determine their reliability, comprehensiveness, and potential applicability in patient education and clinical decision-making processes 2.3. Evaluation Criteria Various tools were used to analyze the responses generated by the artificial intelligence models: EQIP (Ensuring Quality Information for Patients) : A tool used to assess the validity and reliability of health-related information (16,17). DISCERN : A standardized instrument designed to evaluate the credibility and accuracy of medical content (Fig. 3 )(18). Global Quality Scale (GQS) : Originally developed to assess the quality of educational video content, this tool was adapted to evaluate AI-generated text responses (19). Flesch Reading Ease Score (FRES) and Flesch-Kincaid Reading Grade Level (FKRGL) : These tools were used to measure the readability and comprehensibility of the responses (20,21). Similarity Index (Plagiarism Detection) : The originality of the responses was assessed using the iThenticate software. The responses generated by the AI system were independently evaluated by three researchers, all of whom are co-authors of this manuscript. As this study did not involve human participants or personal data, ethical approval was not required. A calibration process was conducted to improve inter-rater reliability, and consistency was ensured by calculating the Intraclass Correlation Coefficient (ICC). In this study, the Global Quality Scale (GQS) was used as one of the evaluation criteria. Originally developed to assess the educational value of video content, the GQS can also be applied to evaluate the quality of content in various formats. During the evaluation process, factors such as informational level, content quality, consistency, and benefit to patients were taken into account. The scoring system ranges from 1 to 5, with 5 representing the highest quality. In this study, the scale was adapted to the text format, and AI-generated responses were analyzed accordingly. A score of 1 indicates low-quality content with insufficient information and little to no value for individuals, while a score of 5 reflects highly consistent, high-quality content that is valuable and beneficial to individuals (Fig. 4 ) (18). FRES and FKRGL tests are evaluation methods used to measure the readability and comprehensibility of texts (19,20). These tests determine the reading difficulty of an English text based on criteria such as sentence length and the number of syllables in words. Flesch Reading Ease Formula : 206.835 − 1.015×(total words/total sentences) − 84.6×(total syllables/total words)206.835 − 1.015×(total words/total sentences) − 84.6×(total syllables/total words) Flesch-Kincaid Grade Level Formula : 0.39×(total words/total sentences) + 11.8×(total syllables/total words) − 15.590.39×(total words/total sentences) + 11.8×(total syllables/total words) − 15.59 The FRES test scores readability on a scale from 0 to 100. Higher scores indicate easier-to-understand content, while lower scores reflect more complex texts. For example: 90–100: Very easy (suitable for 5th-grade students), 60–70: Standard (8th–9th grade), 30–50: College level, 10–0: Requires professional-level understanding. The FKRGL test indicates the educational grade level needed to comprehend the text within the U.S. school system: 0–6: Basic reading level, 6–12: Intermediate, 12–18: Advanced reading level. Both tests assume that longer sentences and multi-syllabic words reduce readability, while short and simple sentences enhance comprehension. Microsoft recommends a FRES score of 60–70 and a FKRGL score of 7.0–8.0 for standard documents(22). Similarity Index was used to determine the extent to which AI-generated responses overlapped with existing textual content from various databases. The primary goal was to detect possible instances of plagiarism and assess the originality of the responses. All responses were uploaded to iThenticate ( http://www.ithenticate.com ) software, and similarity ratios were calculated as percentages. Similarity levels were categorized as follows: 0–10%: Highly original 10–20%: Acceptable similarity 20–40%: High similarity 40–100%: Very high similarity The responses to the 20 fluoride-related questions generated by the AI models were analyzed according to predetermined criteria by three researchers. A calibration process was implemented to enhance inter-rater reliability. During this process, all criteria were thoroughly explained to the evaluators, and a shared interpretation method was developed. After training, responses to 10 questions (excluded from the study) were evaluated, and 14 days later, the same responses were re-evaluated and compared. During the calibration process, consistency between the scores assigned by the two evaluators was assessed using the Intraclass Correlation Coefficient (ICC). Additionally, similarity in repeated measurements across different times and raters was analyzed using the test–retest method and evaluated again with the ICC. The results showed that both intra- and inter-observer ICC values were above 0.700, indicating that all three evaluators were competent to conduct the study. 2.4. Statistical Analysis All statistical analyses were performed using IBM SPSS Statistics version 25.0 (IBM Corp., Armonk, NY, USA). Descriptive statistics, including mean, standard deviation (SD), median, and interquartile range (IQR), were computed for each chatbot model (ChatGPT, Gemini, and DeepSeek) across six evaluation domains: EQIP, DISCERN, GQS, FRES, FKRGL, and Similarity Index. Normality of data distribution was assessed using the Shapiro-Wilk test, while the Levene’s test was used to evaluate the homogeneity of variances. For datasets that satisfied parametric assumptions, one-way analysis of variance (ANOVA) was employed, followed by Tukey’s Honestly Significant Difference (HSD) post-hoc test for multiple comparisons. In cases where parametric assumptions were violated, the Kruskal-Wallis test was applied, with Bonferroni-adjusted Mann–Whitney U tests conducted post hoc to determine pairwise differences. A significance level of p < 0.05 was considered statistically significant for all analyses. The selection of appropriate tests for each metric was based on distributional characteristics. Accordingly: ANOVA was used for FRES and EQIP scores. Kruskal-Wallis was applied for FKRGL, DISCERN, and Similarity Index. Tukey HSD was used post hoc for EQIP scores. Mann–Whitney U with Bonferroni correction was used post hoc for FKRGL and DISCERN. These procedures enabled a rigorous comparison of chatbot performance across various aspects of quality, reliability, readability, and originality. 3. RESULTS 3.1. Descriptive Analysis of General and Clinical Responses Descriptive statistics for the chatbot responses were summarized under two main domains: general knowledge questions and clinical application questions. These included mean, standard deviation (SD), median, and interquartile ranges (Q1–Q3). General Questions : The highest mean score for general questions was observed in the ChatGPT-4.o model (M = 4.32, SD = 0.43), followed by Gemini (M = 3.74, SD = 0.66) and DeepSeek V3 (M = 3.48, SD = 0.57). The median values followed a similar trend. Clinical Questions : ChatGPT-4.o also scored highest on clinical questions (M = 4.20, SD = 0.48), with Gemini and DeepSeek trailing behind. 3.2. Comparative Evaluation Across Quality Metrics To assess the overall performance of the chatbot models across multiple dimensions, statistical analyses were conducted for the following metrics: EQIP, DISCERN, GQS, FRES, FKRGL, and Similarity Index. FRES and Similarity Index : No statistically significant differences were observed among the three models (ANOVA for FRES, p = 0.12; Kruskal-Wallis for Similarity Index, p = 0.54). FKRGL : A significant difference was found (Kruskal-Wallis, p = 0.041), though post-hoc comparisons did not reach statistical significance after Bonferroni correction. EQIP : ChatGPT-4.o outperformed both DeepSeek and Gemini ( p < 0.001), confirmed by Tukey HSD post-hoc analysis. DISCERN : Significant differences were found (Kruskal-Wallis, p < 0.001), with ChatGPT-4.o again outperforming the others in pairwise comparisons. 3.3. Summary of Pairwise Differences Post-hoc analyses confirmed that: ChatGPT-4.o was superior to both Gemini and DeepSeek in terms of EQIP and DISCERN scores ( p < 0.05). There were no significant differences between Gemini and DeepSeek in these metrics. FKRGL differences were marginal and not statistically significant after correction. 3.4. Graphical Representation Figure 5 and Figure 6 illustrate the distribution of chatbot performance across general and clinical domains. Figure 7 illustrates the distribution of EQIP (Ensuring Quality Information for Patients) scores across the three chatbot models—ChatGPT, DeepSeek, and Gemini. The box plot displays the median values, interquartile ranges, and potential outliers for each model, allowing a comparative evaluation of the quality of patient-directed health information provided by these AI systems. Figure 8 demonstrates the comparative distribution of DISCERN scores among ChatGPT, DeepSeek, and Gemini. The results indicate that ChatGPT achieved substantially higher median and overall scores, suggesting superior reliability and quality of health information related to treatment choices. In contrast, DeepSeek and Gemini performed similarly, but with lower scores and narrower ranges. Figure 9 presents the Flesch-Kincaid Reading Grade Level (FKRGL) scores of the chatbot-generated responses. Among the models, DeepSeek and Gemini exhibited higher median FKRGL scores, indicating that their outputs were written at a more advanced reading level. ChatGPT's responses, on the other hand, were more accessible, requiring a lower grade-level comprehension. While DeepSeek showed greater score variability, Gemini maintained a narrower and more consistent range. Figure 10 illustrates the Flesch Reading Ease Score (FRES) distributions across the chatbot models. ChatGPT achieved the highest median FRES values, indicating that its responses were the most readable. DeepSeek showed moderate readability scores with a wider range, while Gemini’s lower FRES values suggest that its responses were more difficult to comprehend. These findings imply that ChatGPT may be more suitable for delivering easily understandable health information. 4. DISCUSSION This study presents a comprehensive evaluation of four AI-based chatbot models— ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3—regarding their performance in communicating fluoride-related information to pediatric patients and their caregivers. The findings reveal that ChatGPT-4.o consistently outperformed other models in both general knowledge and clinical application domains, particularly in EQIP and DISCERN metrics, which assess informational quality and reliability. Although no statistically significant differences were observed in FRES and Similarity Index scores, a notable trend emerged indicating ChatGPT's superiority in delivering readable and original content. These findings align with previous literature suggesting that large language models (LLMs) based on GPT architecture are capable of generating coherent, informative, and medically relevant responses when appropriately prompted (Lyu et al., 2024; Haque et al., 2023). The high EQIP and DISCERN scores obtained by ChatGPT-4.o reflect its strength in presenting balanced, evidence-based, and clearly structured content—qualities essential for patient and caregiver education. In contrast, Gemini Pro and DeepSeek V3 showed moderate performance, often falling short in completeness and clinical applicability. The in performance across AI models underscores the importance of model selection in digital health applications. Chatbots may serve as accessible, low-cost tools for augmenting oral health literacy, especially in underserved communities. Pupong et al. ( 2025 ) demonstrated that a chatbot developed for young children’s oral care was well accepted by parents and improved health behavior outcomes when evaluated using a mixed-methods approach. Their findings support the feasibility of chatbot use in pediatric oral health education when systems are tailored to user needs and include behaviorally oriented language. Similarly, Han and Lee ( 2023 ) reported that AI-driven conversational agents significantly improved adherence to health-promoting behaviors such as routine hygiene, suggesting a possible translation to fluoride application and brushing habits in children. These results are consistent with Li et al. ( 2023 ), who showed that chatbots could improve mental well-being and engagement in self-care routines. From a trust perspective, Jin et al. ( 2024 ) highlighted that gendered cues and personalization in chatbot design increase acceptance, especially among female caregivers—a finding with implications for pediatric oral health campaigns. Moreover, efforts to align chatbot content with plain language and health literacy standards can help prevent miscommunication, particularly in vulnerable populations (Zhou et al., 2022 ). Finally, Bhattacharya and Pissurlenkar ( 2023 ) underscore that chatbots in healthcare must be rigorously validated to avoid spreading misinformation, especially in areas like fluoride use where public perception is polarized. This calls for integrating clinical oversight and transparency into AI systems. Buldur and Sezer ( 2024 ) found that ChatGPT-4 provided content that was comparable in quality to official FDA guidance when responding to frequently asked questions about dental amalgam, indicating a promising degree of informational alignment. However, they also noted that ChatGPT's phrasing and perspectives occasionally differed from regulatory language, highlighting the importance of content review. Similarly, Gugnani et al. ( 2024 ) evaluated ChatGPT’s answers from a parental perspective on pediatric oral health and found that while responses were generally clear and logical, gaps in clinical nuance and missing preventive advice were common. This reinforces our findings that ChatGPT-4 may be suitable as a supplementary educational tool but should not replace expert guidance. Despite the potential of AI chatbots in health education, their use must be approached with caution. One concern is the risk of misinformation, especially when responses lack source attribution or contain outdated guidance. Additionally, the similarity index revealed that while most responses were original, certain overlaps with existing online content still exist. Both studies support the notion that although ChatGPT-4 demonstrates consistency and surface-level accuracy, its role in healthcare communication requires further validation through expert and interdisciplinary scrutiny. Thus, relying solely on chatbot-generated content for health decision-making may not be advisable without professional oversight. The present study’s strengths include its systematic evaluation across multiple quality metrics, the use of standardized questions, and rigorous inter-rater reliability procedures. However, limitations must also be acknowledged. First, the study focused exclusively on English-language responses, which may not reflect performance in other languages. Second, the analysis was limited to fluoride-related content, and broader health topics may yield different results. Lastly, the simulated nature of the chatbot interactions may not fully capture real-world variability in user queries and follow-up questions. Future research should expand to include multilingual evaluations, assess responses to more complex patient scenarios, and explore the integration of chatbot tools within clinical workflows. Furthermore, collaborations between AI developers and health professionals could lead to the development of specialized medical models optimized for patient education and safety. In conclusion, while AI-based chatbots—especially ChatGPT-4.o —demonstrate promising capabilities in conveying fluoride-related information, their deployment in clinical or educational settings should be supported by validation, regulation, and ethical oversight to ensure safety and effectiveness. 5. CONCLUSION This study offers a novel and methodologically robust evaluation of how artificial intelligence chatbots convey fluoride-related information to pediatric patients and caregivers. Among the models assessed, ChatGPT-4.o consistently demonstrated superior performance across multiple metrics of quality, reliability, and readability. Its higher EQIP and DISCERN scores suggest that it is more capable of delivering accurate, evidence-based, and user-friendly health information than Gemini Pro or DeepSeek V3. However, while ChatGPT-4.o shows potential as a digital health education tool, our findings also highlight the limitations and risks associated with relying on AI-generated content in clinical contexts. Variability in response quality, lack of clinical nuance, and moderate reading complexity underscore the need for cautious implementation and professional validation. Ultimately, AI chatbots may serve as valuable adjuncts to professional guidance in promoting oral health literacy, especially in settings where access to care is limited. To fully realize their potential, future efforts should prioritize model refinement, transparency of sources, and user-centered design tailored to diverse literacy levels. Declarations Ethics Approval and Consent to Participate This study did not involve human participants, personal data, or biological materials. It was based solely on the evaluation of publicly available AI-generated content. Therefore, ethics committee approval or Institutional Review Board (IRB) review was not required. Funding Declaration This research received no external funding. Human Ethics and Consent to Participate Declarations Human Ethics and Consent to Participate declarations: not applicable Author Contributions All authors contributed significantly to the development of this manuscript. Nevra Karamüftüoğlu : Conceptualization, methodology, data analysis, writing—original draft preparation. Ezgi Aydın Varol : Data collection, software implementation, review and editing. Cenkhan Bal : Evaluation framework design, statistical validation, visualization. All authors read and approved the final version of the manuscript. Competing Interests The authors declare that they have no financial or non-financial competing interests related to the content of this manuscript. Data availibility The datasets used and/or analysed during the current study available from the corresponding author on reasonable request. References Bhattacharya, B. S. & Pissurlenkar, V. S. Assistive chatbots for healthcare: A succinct review. arXiv preprint arXiv:2308.04178 . (2023). https://arxiv.org/abs/2308.04178 Buldur, M. & Sezer, B. Evaluating the accuracy of Chat Generative Pre-trained Transformer version 4 (ChatGPT-4) responses to United States Food and Drug Administration (FDA) frequently asked questions about dental amalgam. BMC Oral Health . 24 , 605. https://doi.org/10.1186/s12903-024-04358-8 (2024). Centers for Disease Control and Prevention (CDC). Recommendations for using fluoride to prevent and control dental caries in the United States. Morb. Mortal. Wkly Rep. 50 (RR14), 1–42 (2001). DeepSeek Research Team. DeepSeek-LLM: A family of open LLMs and math reasoning models [arXiv preprint]. arXiv. (2025). https://arxiv.org/pdf/2501.12948 Gugnani, N., Pandit, I. K., Gupta, M., Gugnani, S. & Kathuria, S. Parental concerns about oral health of children: Is ChatGPT helpful in finding appropriate answers? J. Indian Soc. Pedod. Prev. Dentistry . 42 (2), 104–111. https://doi.org/10.4103/jisppd.jisppd_110_24 (2024). Han, M. & Lee, J. Systematic review and meta-analysis of the effectiveness of chatbots for promoting physical activity, healthy eating, and sleep. npj Digit. Med. 6 (1), 236. https://doi.org/10.1038/s41746-023-00979-5 (2023). Jin, E., Ryoo, Y., Kim, W. J. & Song, Y. G. Bridging the health literacy gap through AI chatbot design: The impact of gender and doctor cues on chatbot trust and acceptance. Internet Res. 34 (1), 123–145. https://doi.org/10.1108/intr-08-2023-0702 (2024). Li, H., Zhang, R., Lee, Y. C., Kraut, R. E. & Mohr, D. C. Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. npj Digit. Med. 6 (1), 236. https://doi.org/10.1038/s41746-023-00979-5 (2023). Pupong, K., Hunsrisakhun, J., Pithpornchaiyakul, S. & Naorungroj, S. Development of chatbot-based oral health care for young children and evaluation of its effectiveness, usability, and acceptability: Mixed methods study. JMIR Pediatr. Parent. 5 (1), e62738. https://doi.org/10.2196/62738 (2025). Samaranayake, L., Porntaveetus, T., Tsoi, J. & Tuygunov, N. Facts and fallacies of the fluoride controversy: A contemporary perspective. Int. Dent. J. 75 (4), 100833. https://doi.org/10.1016/j.identj.2024.100833 (2025). Spittle, B. Using artificial intelligence to obtain sufficient and reliable answers to questions about fluoride. Fluoride 57 (2), 197–204 (2024). https://www.fluorideresearch.online/epub/files/233.pdf ten Cate, J. M. Contemporary perspective on the use of fluoride products in caries prevention. Br. Dent. J. 214 (4), 161–167. https://doi.org/10.1038/sj.bdj.2013.127 (2013). Weyant, R. J. et al. Topical fluoride for caries prevention: Executive summary of the updated clinical recommendations and supporting systematic review. J. Am. Dent. Assoc. 144 (11), 1279–1291. https://doi.org/10.14219/jada.archive.2013.0057 (2013). Zhou, Y., Oniani, D., Sreekumar, S., DeAlmeida, R. & Wang, Y. Toward improving health literacy in patient education materials with neural machine translation models. arXiv preprint arXiv:2209.06723 . https://arxiv.org/abs/2209.06723 (2022). Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 29 Nov, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 04 Sep, 2025 Reviews received at journal 14 Aug, 2025 Reviews received at journal 09 Aug, 2025 Reviewers agreed at journal 09 Aug, 2025 Reviews received at journal 05 Aug, 2025 Reviews received at journal 04 Aug, 2025 Reviewers agreed at journal 04 Aug, 2025 Reviewers agreed at journal 03 Aug, 2025 Reviewers agreed at journal 02 Aug, 2025 Reviewers agreed at journal 02 Aug, 2025 Reviewers agreed at journal 02 Aug, 2025 Reviewers invited by journal 02 Aug, 2025 Editor assigned by journal 02 Aug, 2025 Editor invited by journal 11 Jul, 2025 Submission checks completed at journal 01 Jul, 2025 First submitted to journal 01 Jul, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6993451","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":497324133,"identity":"e759939a-b1e8-4b0c-b590-efbbe458622d","order_by":0,"name":"Nevra Karamüftüoğlu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA4klEQVRIiWNgGAWjYPCCA0DMfAzMZGMnWgsbWxoDQwKQZiZeC48ZWAsDIS3mErmPP3zcc0dOfn7Ptwcff2yT52NmYPzwMQe3FssZ6WaSM549MzY4xrvdcEbCbcM2ZgZmyZnbcGsxOHOMjZnnwOHEDWy826R5Em4zArWwMfPi18L8+c+Bw/Xz23iegbTYE9ZyvI1BmuHA4QSGYzxsIC2JxGhhk+w58Mxww7E0oKfSbie3MTM24/fLYTbmDz8O3JGXbz78TOKDzW3b+e3NBz98xKMFG2BsIE39KBgFo2AUjAIMAABSDU/O9EDXlwAAAABJRU5ErkJggg==","orcid":"","institution":"University of Health Sciences","correspondingAuthor":true,"prefix":"","firstName":"Nevra","middleName":"","lastName":"Karamüftüoğlu","suffix":""},{"id":497324135,"identity":"a54428f5-220a-47fa-a44b-1506616b8a56","order_by":1,"name":"Ezgi Aydın Varol","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Ezgi","middleName":"Aydın","lastName":"Varol","suffix":""},{"id":497324136,"identity":"819fa00d-e11a-48e0-8227-36eee7d3f3af","order_by":2,"name":"Cenkhan Bal","email":"","orcid":"","institution":"University of Health Sciences","correspondingAuthor":false,"prefix":"","firstName":"Cenkhan","middleName":"","lastName":"Bal","suffix":""}],"badges":[],"createdAt":"2025-06-27 16:53:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6993451/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6993451/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-28857-y","type":"published","date":"2025-11-29T15:57:21+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":88754963,"identity":"1703f519-74a4-43b8-bab8-20b83468c718","added_by":"auto","created_at":"2025-08-11 07:09:37","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":145162,"visible":true,"origin":"","legend":"\u003cp\u003eFlowchart of the study\u003c/p\u003e","description":"","filename":"Picture1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/536fe795a4b191232a88f076.jpg"},{"id":88757438,"identity":"a6d8b585-9c3b-4bee-acce-378cace7cf9c","added_by":"auto","created_at":"2025-08-11 07:25:37","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":112187,"visible":true,"origin":"","legend":"\u003cp\u003eGeneral and clinical questions about flouride\u003c/p\u003e","description":"","filename":"Picture2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/2eb91e67ee308322cfeee9c3.jpg"},{"id":88754967,"identity":"b02af80b-4d51-4fca-973f-1ea34107d7b0","added_by":"auto","created_at":"2025-08-11 07:09:37","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":50755,"visible":true,"origin":"","legend":"\u003cp\u003eReliability Score (Adapted from DISCERN) Description (17)\u003c/p\u003e","description":"","filename":"Picture3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/6cc3b6959d94d012013ac398.jpg"},{"id":88756393,"identity":"217d0ef6-5246-4c42-95ba-276a0c13be4c","added_by":"auto","created_at":"2025-08-11 07:17:37","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":93901,"visible":true,"origin":"","legend":"\u003cp\u003eGlobal Quality Score (GQS) Description (18)\u003c/p\u003e","description":"","filename":"Picture4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/f51293377d8eb7809434a3a1.jpg"},{"id":88756399,"identity":"c6385679-c126-438e-bafa-47a542460514","added_by":"auto","created_at":"2025-08-11 07:17:37","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":42870,"visible":true,"origin":"","legend":"\u003cp\u003eDistribution of general knowledge scores across ChatGPT-4.o, Gemini, and DeepSeek V3\u003c/p\u003e","description":"","filename":"Picture5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/5af53938a24f69d8ce4ce027.jpg"},{"id":88754969,"identity":"ac526105-607a-4481-a6ab-fc548fe0c92e","added_by":"auto","created_at":"2025-08-11 07:09:37","extension":"jpg","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":37371,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eDistribution of clinical application scores across \u003c/strong\u003eChatGPT-4.o\u003cstrong\u003e, Gemini, and DeepSeek V3\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Picture6.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/598e235cd5a7c12de3f49cf4.jpg"},{"id":88756402,"identity":"7e1185ee-e693-4d62-8401-f59389365e8a","added_by":"auto","created_at":"2025-08-11 07:17:38","extension":"jpg","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":28698,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of EQIP scores among the chatbot models\u003c/p\u003e","description":"","filename":"Picture7.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/176a90f617fd0a9e980ae434.jpg"},{"id":88754973,"identity":"326214d6-fa4c-4aa8-b89d-fb50c97f67f7","added_by":"auto","created_at":"2025-08-11 07:09:37","extension":"jpg","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":10380,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of DISCERN scores across chatbot models\u003c/p\u003e","description":"","filename":"Picture8.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/16809c0169c76eecdbe111be.jpg"},{"id":88754972,"identity":"10891bc4-39e8-4857-ac65-334faf4257d9","added_by":"auto","created_at":"2025-08-11 07:09:37","extension":"jpg","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":37131,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of FKRGL (Flesch-Kincaid Reading Grade Level) scores across chatbot responses.\u003c/p\u003e","description":"","filename":"Picture9.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/4f09bed7e41e5f23d3b44422.jpg"},{"id":88757439,"identity":"dea91ed2-2c07-47e0-9d2f-fce7121c2538","added_by":"auto","created_at":"2025-08-11 07:25:38","extension":"jpg","order_by":10,"title":"Figure 10","display":"","copyAsset":false,"role":"figure","size":36934,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of FRES (Flesch Reading Ease Score) among chatbot responses.\u003c/p\u003e","description":"","filename":"Picture10.jpg","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/fa65939a2997f4831983a0b8.jpg"},{"id":97178302,"identity":"0b845218-9b26-42c0-8c4e-3e549b876972","added_by":"auto","created_at":"2025-12-01 16:07:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1609098,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6993451/v1/6614833e-702a-453a-b5f5-678deafb3de3.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eAi Chatbots for Pediatric Fluoride Education: An Effectiveness Study\u003c/p\u003e","fulltext":[{"header":"1. INTRODUCTION","content":"\u003cp\u003eFluoride remains a cornerstone of preventive dentistry due to its well-established effects on inhibiting demineralization, enhancing remineralization, and reducing cariogenic bacterial activity (ten Cate, \u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e2013\u003c/span\u003e). Community-level strategies such as water fluoridation and fluoride toothpaste use have significantly reduced caries prevalence and are considered both effective and cost-efficient (CDC, 2001; Griffin et al., 2007). In pediatric populations, professionally applied fluoride varnishes and gels are routinely recommended, particularly for children at increased risk of caries (Weyant et al., \u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e2013\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eDespite its proven efficacy, fluoride continues to be subject to public misperceptions and misinformation, particularly in digital environments. Parental concerns, safety hesitancy, and conflicting online narratives often hinder the adoption of fluoride-based preventive measures (Samaranayake et al., \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). In this context, there is a growing need for trustworthy, accessible, and comprehensible digital resources that can effectively deliver fluoride-related information, especially to caregivers (Spittle, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e2024\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eRecent developments in artificial intelligence (AI), particularly large language models (LLMs) like ChatGPT and Google Gemini, have introduced new avenues for digital health communication. These AI chatbots are capable of providing on-demand, semantically rich, and interactive responses that can enhance user understanding of health topics, including oral care (Jin et al., \u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2024\u003c/span\u003e; Zhou et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2022\u003c/span\u003e; Pupong et al., \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2025\u003c/span\u003e). Studies have shown that AI tools designed with reliable and user-centered principles can facilitate health behavior change and support decision-making in pediatric care (Li et al., \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Han \u0026amp; Lee, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2023\u003c/span\u003e; Bhattacharya \u0026amp; Pissurlenkar, \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2023\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eAs online health-seeking behavior becomes more prevalent\u0026mdash;with over 80% of individuals relying on internet searches for medical guidance\u0026mdash;there is increasing demand for credible and readable content that aligns with evidence-based guidelines (Buldur \u0026amp; Sezer, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2024\u003c/span\u003e). LLMs have rapidly gained popularity due to their scalability and natural language generation capabilities. Models such as ChatGPT-4.o, Google Gemini Pro, and the newly released DeepSeek V3 are now widely used in health information delivery. These systems differ in architecture, source access, and training data, which may influence the consistency, quality, and originality of their responses (DeepSeek Research Team, \u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2025\u003c/span\u003e).\u003c/p\u003e\u003cp\u003e While AI chatbots offer promising solutions to enhance oral health literacy, concerns remain about the factual accuracy, ethical transparency, and clinical applicability of their outputs. Therefore, critical evaluation of their performance in conveying fluoride information is essential. This study aims to assess and compare the responses of four advanced AI chatbots\u0026mdash; ChatGPT-4.o (OpenAI, San Francisco, CA, USA), Google Gemini Pro (Google DeepMind), and DeepSeek V3 (Hangzhou DeepSeek AI Co., Ltd.)\u0026mdash;across dimensions of accuracy, reliability, readability, and originality, specifically in the context of pediatric dentistry and caregiver education.\u003c/p\u003e"},{"header":"2. MATERIALS AND METHODS","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1. Formulation of Questions on Flouride\u003c/h2\u003e\u003cp\u003eThis study followed a multi-step methodology to formulate, distribute, and analyze fluoride-related queries presented to the selected chatbot models (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Questions regarding fluoride were developed based on the most recent guideline on fluoride published by the American Academy of Pediatric Dentistry (15). Before formulating the questions, a preliminary search was conducted using the term \"fluoride\" via the Google search engine (1998, USA) to gain background information and to identify the most common topics of interest among parents.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e Subsequently, the guideline was reviewed with a focus on the most frequently asked questions by parents, and an initial question set was developed by two pediatric dentists with twenty-five years of clinical experience. This initial set was then reviewed and revised for clarity and relevance by a third pediatric dentist.\u003c/p\u003e\u003cp\u003eThese questions were directed to ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3 without using any guiding prompts. A separate conversation window was opened in each chatbot interface for each question.\u003c/p\u003e\u003cp\u003eThe final set of questions was classified under two main categories:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eGeneral information (basic concepts and evidence-based knowledge)\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eClinical evaluations/clinical applications (practical aspects and procedural details)\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eAll questions were presented to each AI model in independent conversation sessions (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). This approach was implemented to ensure standardization during the evaluation process.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2. Identification of Questions Related to Fluoride\u003c/h2\u003e\u003cp\u003eThe methodological framework of the study is summarized in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. As no human or animal material was used in this study, ethical approval was not required.\u003c/p\u003e\u003cp\u003eTo gather information related to fluoride, the following AI models were utilized: ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3.\u003c/p\u003e\u003cp\u003eThe aim of the study was to evaluate the accuracy, completeness, and clarity of the information provided by these AI models to patients and parents regarding this procedure.\u003c/p\u003e\u003cp\u003eTo achieve this aim, the AI models\u0026rsquo; ability to respond to targeted questions and engage in simulated patient-provider dialogues was assessed. The evaluation was conducted using predefined questions that reflect commonly asked patient inquiries related to fluoride. The responses generated by the models were systematically analyzed to determine their reliability, comprehensiveness, and potential applicability in patient education and clinical decision-making processes\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3. Evaluation Criteria\u003c/h2\u003e\u003cp\u003eVarious tools were used to analyze the responses generated by the artificial intelligence models:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEQIP (Ensuring Quality Information for Patients)\u003c/b\u003e: A tool used to assess the validity and reliability of health-related information (16,17).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eDISCERN\u003c/b\u003e: A standardized instrument designed to evaluate the credibility and accuracy of medical content (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e)(18).\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eGlobal Quality Scale (GQS)\u003c/b\u003e: Originally developed to assess the quality of educational video content, this tool was adapted to evaluate AI-generated text responses (19).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFlesch Reading Ease Score (FRES) and Flesch-Kincaid Reading Grade Level (FKRGL)\u003c/b\u003e: These tools were used to measure the readability and comprehensibility of the responses (20,21).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eSimilarity Index (Plagiarism Detection)\u003c/b\u003e: The originality of the responses was assessed using the iThenticate software.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe responses generated by the AI system were independently evaluated by three researchers, all of whom are co-authors of this manuscript. As this study did not involve human participants or personal data, ethical approval was not required. A calibration process was conducted to improve inter-rater reliability, and consistency was ensured by calculating the Intraclass Correlation Coefficient (ICC).\u003c/p\u003e\u003cp\u003eIn this study, the Global Quality Scale (GQS) was used as one of the evaluation criteria. Originally developed to assess the educational value of video content, the GQS can also be applied to evaluate the quality of content in various formats. During the evaluation process, factors such as informational level, content quality, consistency, and benefit to patients were taken into account.\u003c/p\u003e\u003cp\u003eThe scoring system ranges from 1 to 5, with 5 representing the highest quality. In this study, the scale was adapted to the text format, and AI-generated responses were analyzed accordingly. A score of 1 indicates low-quality content with insufficient information and little to no value for individuals, while a score of 5 reflects highly consistent, high-quality content that is valuable and beneficial to individuals (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e) (18).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFRES and FKRGL tests are evaluation methods used to measure the readability and comprehensibility of texts (19,20). These tests determine the reading difficulty of an English text based on criteria such as sentence length and the number of syllables in words.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFlesch Reading Ease Formula\u003c/b\u003e:\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e206.835\u0026thinsp;\u0026minus;\u0026thinsp;1.015\u0026times;(total words/total sentences)\u0026thinsp;\u0026minus;\u0026thinsp;84.6\u0026times;(total syllables/total words)206.835\u0026thinsp;\u0026minus;\u0026thinsp;1.015\u0026times;(total words/total sentences)\u0026thinsp;\u0026minus;\u0026thinsp;84.6\u0026times;(total syllables/total words)\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFlesch-Kincaid Grade Level Formula\u003c/b\u003e:\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003cdiv class=\"BlockQuote\"\u003e\u003cp\u003e0.39\u0026times;(total words/total sentences)\u0026thinsp;+\u0026thinsp;11.8\u0026times;(total syllables/total words)\u0026thinsp;\u0026minus;\u0026thinsp;15.590.39\u0026times;(total words/total sentences)\u0026thinsp;+\u0026thinsp;11.8\u0026times;(total syllables/total words)\u0026thinsp;\u0026minus;\u0026thinsp;15.59\u003c/p\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eThe FRES test scores readability on a scale from 0 to 100. Higher scores indicate easier-to-understand content, while lower scores reflect more complex texts. For example:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e90\u0026ndash;100: Very easy (suitable for 5th-grade students),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e60\u0026ndash;70: Standard (8th\u0026ndash;9th grade),\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e30\u0026ndash;50: College level,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e10\u0026ndash;0: Requires professional-level understanding.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe FKRGL test indicates the educational grade level needed to comprehend the text within the U.S. school system:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e0\u0026ndash;6: Basic reading level,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e6\u0026ndash;12: Intermediate,\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e12\u0026ndash;18: Advanced reading level.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eBoth tests assume that longer sentences and multi-syllabic words reduce readability, while short and simple sentences enhance comprehension. Microsoft recommends a FRES score of 60\u0026ndash;70 and a FKRGL score of 7.0\u0026ndash;8.0 for standard documents(22).\u003c/p\u003e\u003cp\u003eSimilarity Index was used to determine the extent to which AI-generated responses overlapped with existing textual content from various databases. The primary goal was to detect possible instances of plagiarism and assess the originality of the responses. All responses were uploaded to iThenticate (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.ithenticate.com\u003c/span\u003e\u003cspan address=\"http://www.ithenticate.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) software, and similarity ratios were calculated as percentages.\u003c/p\u003e\u003cp\u003eSimilarity levels were categorized as follows:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e0\u0026ndash;10%: Highly original\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e10\u0026ndash;20%: Acceptable similarity\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e20\u0026ndash;40%: High similarity\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e40\u0026ndash;100%: Very high similarity\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThe responses to the 20 fluoride-related questions generated by the AI models were analyzed according to predetermined criteria by three researchers. A calibration process was implemented to enhance inter-rater reliability. During this process, all criteria were thoroughly explained to the evaluators, and a shared interpretation method was developed. After training, responses to 10 questions (excluded from the study) were evaluated, and 14 days later, the same responses were re-evaluated and compared.\u003c/p\u003e\u003cp\u003eDuring the calibration process, consistency between the scores assigned by the two evaluators was assessed using the Intraclass Correlation Coefficient (ICC). Additionally, similarity in repeated measurements across different times and raters was analyzed using the test\u0026ndash;retest method and evaluated again with the ICC. The results showed that both intra- and inter-observer ICC values were above 0.700, indicating that all three evaluators were competent to conduct the study.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e2.4. Statistical Analysis\u003c/h2\u003e\u003cp\u003eAll statistical analyses were performed using IBM SPSS Statistics version 25.0 (IBM Corp., Armonk, NY, USA). Descriptive statistics, including mean, standard deviation (SD), median, and interquartile range (IQR), were computed for each chatbot model (ChatGPT, Gemini, and DeepSeek) across six evaluation domains: EQIP, DISCERN, GQS, FRES, FKRGL, and Similarity Index.\u003c/p\u003e\u003cp\u003eNormality of data distribution was assessed using the Shapiro-Wilk test, while the Levene\u0026rsquo;s test was used to evaluate the homogeneity of variances. For datasets that satisfied parametric assumptions, one-way analysis of variance (ANOVA) was employed, followed by Tukey\u0026rsquo;s Honestly Significant Difference (HSD) post-hoc test for multiple comparisons. In cases where parametric assumptions were violated, the Kruskal-Wallis test was applied, with Bonferroni-adjusted Mann\u0026ndash;Whitney U tests conducted post hoc to determine pairwise differences.\u003c/p\u003e\u003cp\u003eA significance level of \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.05 was considered statistically significant for all analyses. The selection of appropriate tests for each metric was based on distributional characteristics. Accordingly:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eANOVA was used for FRES and EQIP scores.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eKruskal-Wallis was applied for FKRGL, DISCERN, and Similarity Index.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eTukey HSD was used post hoc for EQIP scores.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eMann\u0026ndash;Whitney U with Bonferroni correction was used post hoc for FKRGL and DISCERN.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003cp\u003eThese procedures enabled a rigorous comparison of chatbot performance across various aspects of quality, reliability, readability, and originality.\u003c/p\u003e\u003c/div\u003e"},{"header":"3. RESULTS","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003e3.1. Descriptive Analysis of General and Clinical Responses\u003c/h2\u003e\u003cp\u003eDescriptive statistics for the chatbot responses were summarized under two main domains: general knowledge questions and clinical application questions. These included mean, standard deviation (SD), median, and interquartile ranges (Q1\u0026ndash;Q3).\u003c/p\u003e\u003cp\u003e\u003cb\u003eGeneral Questions\u003c/b\u003e:\u003c/p\u003e\u003cp\u003eThe highest mean score for general questions was observed in the ChatGPT-4.o model (M\u0026thinsp;=\u0026thinsp;4.32, SD\u0026thinsp;=\u0026thinsp;0.43), followed by Gemini (M\u0026thinsp;=\u0026thinsp;3.74, SD\u0026thinsp;=\u0026thinsp;0.66) and DeepSeek V3 (M\u0026thinsp;=\u0026thinsp;3.48, SD\u0026thinsp;=\u0026thinsp;0.57). The median values followed a similar trend.\u003c/p\u003e\u003cp\u003e\u003cb\u003eClinical Questions\u003c/b\u003e:\u003c/p\u003e\u003cp\u003eChatGPT-4.o also scored highest on clinical questions (M\u0026thinsp;=\u0026thinsp;4.20, SD\u0026thinsp;=\u0026thinsp;0.48), with Gemini and DeepSeek trailing behind.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.2. Comparative Evaluation Across Quality Metrics\u003c/h2\u003e\u003cp\u003eTo assess the overall performance of the chatbot models across multiple dimensions, statistical analyses were conducted for the following metrics: EQIP, DISCERN, GQS, FRES, FKRGL, and Similarity Index.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFRES and Similarity Index\u003c/b\u003e: No statistically significant differences were observed among the three models (ANOVA for FRES, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.12; Kruskal-Wallis for Similarity Index, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.54).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eFKRGL\u003c/b\u003e: A significant difference was found (Kruskal-Wallis, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.041), though post-hoc comparisons did not reach statistical significance after Bonferroni correction.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eEQIP\u003c/b\u003e: ChatGPT-4.o outperformed both DeepSeek and Gemini (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.001), confirmed by Tukey HSD post-hoc analysis.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003e\u003cb\u003eDISCERN\u003c/b\u003e: Significant differences were found (Kruskal-Wallis, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.001), with ChatGPT-4.o again outperforming the others in pairwise comparisons.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.3. Summary of Pairwise Differences\u003c/h2\u003e\u003cp\u003ePost-hoc analyses confirmed that:\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eChatGPT-4.o was superior to both Gemini and DeepSeek in terms of EQIP and DISCERN scores (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.05).\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eThere were no significant differences between Gemini and DeepSeek in these metrics.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eFKRGL differences were marginal and not statistically significant after correction.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.4. Graphical Representation\u003c/h2\u003e\u003cp\u003eFigure 5 and Figure 6 illustrate the distribution of chatbot performance across general and clinical domains.\u0026nbsp;\u003cstrong\u003eFigure 7 illustrates the distribution of EQIP (Ensuring Quality Information for Patients) scores across the three chatbot models—ChatGPT, DeepSeek, and Gemini.\u003c/strong\u003eThe box plot displays the median values, interquartile ranges, and potential outliers for each model, allowing a comparative evaluation of the quality of patient-directed health information provided by these AI systems.\u0026nbsp;\u003cstrong\u003eFigure 8 demonstrates the comparative distribution of DISCERN scores among ChatGPT, DeepSeek, and Gemini.\u003c/strong\u003e The results indicate that ChatGPT achieved substantially higher median and overall scores, suggesting superior reliability and quality of health information related to treatment choices. In contrast, DeepSeek and Gemini performed similarly, but with lower scores and narrower ranges. \u003cstrong\u003eFigure 9 presents the Flesch-Kincaid Reading Grade Level (FKRGL) scores of the chatbot-generated responses.\u0026nbsp;\u003c/strong\u003eAmong the models, DeepSeek and Gemini exhibited higher median FKRGL scores, indicating that their outputs were written at a more advanced reading level. ChatGPT's responses, on the other hand, were more accessible, requiring a lower grade-level comprehension. While DeepSeek showed greater score variability, Gemini maintained a narrower and more consistent range. \u003cstrong\u003eFigure 10 illustrates the Flesch Reading Ease Score (FRES) distributions across the chatbot models.\u003c/strong\u003eChatGPT achieved the highest median FRES values, indicating that its responses were the most readable. DeepSeek showed moderate readability scores with a wider range, while Gemini’s lower FRES values suggest that its responses were more difficult to comprehend. These findings imply that ChatGPT may be more suitable for delivering easily understandable health information.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"4. DISCUSSION","content":"\u003cp\u003eThis study presents a comprehensive evaluation of four AI-based chatbot models\u0026mdash; ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3\u0026mdash;regarding their performance in communicating fluoride-related information to pediatric patients and their caregivers. The findings reveal that ChatGPT-4.o consistently outperformed other models in both general knowledge and clinical application domains, particularly in EQIP and DISCERN metrics, which assess informational quality and reliability. Although no statistically significant differences were observed in FRES and Similarity Index scores, a notable trend emerged indicating ChatGPT's superiority in delivering readable and original content.\u003c/p\u003e\u003cp\u003eThese findings align with previous literature suggesting that large language models (LLMs) based on GPT architecture are capable of generating coherent, informative, and medically relevant responses when appropriately prompted (Lyu et al., 2024; Haque et al., 2023). The high EQIP and DISCERN scores obtained by ChatGPT-4.o reflect its strength in presenting balanced, evidence-based, and clearly structured content\u0026mdash;qualities essential for patient and caregiver education. In contrast, Gemini Pro and DeepSeek V3 showed moderate performance, often falling short in completeness and clinical applicability.\u003c/p\u003e\u003cp\u003eThe in performance across AI models underscores the importance of model selection in digital health applications. Chatbots may serve as accessible, low-cost tools for augmenting oral health literacy, especially in underserved communities. Pupong et al. (\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) demonstrated that a chatbot developed for young children\u0026rsquo;s oral care was well accepted by parents and improved health behavior outcomes when evaluated using a mixed-methods approach. Their findings support the feasibility of chatbot use in pediatric oral health education when systems are tailored to user needs and include behaviorally oriented language.\u003c/p\u003e\u003cp\u003eSimilarly, Han and Lee (\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) reported that AI-driven conversational agents significantly improved adherence to health-promoting behaviors such as routine hygiene, suggesting a possible translation to fluoride application and brushing habits in children. These results are consistent with Li et al. (\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), who showed that chatbots could improve mental well-being and engagement in self-care routines.\u003c/p\u003e\u003cp\u003eFrom a trust perspective, Jin et al. (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) highlighted that gendered cues and personalization in chatbot design increase acceptance, especially among female caregivers\u0026mdash;a finding with implications for pediatric oral health campaigns. Moreover, efforts to align chatbot content with plain language and health literacy standards can help prevent miscommunication, particularly in vulnerable populations (Zhou et al., \u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e2022\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eFinally, Bhattacharya and Pissurlenkar (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2023\u003c/span\u003e) underscore that chatbots in healthcare must be rigorously validated to avoid spreading misinformation, especially in areas like fluoride use where public perception is polarized. This calls for integrating clinical oversight and transparency into AI systems.\u003c/p\u003e\u003cp\u003eBuldur and Sezer (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) found that ChatGPT-4 provided content that was comparable in quality to official FDA guidance when responding to frequently asked questions about dental amalgam, indicating a promising degree of informational alignment. However, they also noted that ChatGPT's phrasing and perspectives occasionally differed from regulatory language, highlighting the importance of content review.\u003c/p\u003e\u003cp\u003eSimilarly, Gugnani et al. (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) evaluated ChatGPT\u0026rsquo;s answers from a parental perspective on pediatric oral health and found that while responses were generally clear and logical, gaps in clinical nuance and missing preventive advice were common. This reinforces our findings that ChatGPT-4 may be suitable as a supplementary educational tool but should not replace expert guidance.\u003c/p\u003e\u003cp\u003eDespite the potential of AI chatbots in health education, their use must be approached with caution. One concern is the risk of misinformation, especially when responses lack source attribution or contain outdated guidance. Additionally, the similarity index revealed that while most responses were original, certain overlaps with existing online content still exist. Both studies support the notion that although ChatGPT-4 demonstrates consistency and surface-level accuracy, its role in healthcare communication requires further validation through expert and interdisciplinary scrutiny. Thus, relying solely on chatbot-generated content for health decision-making may not be advisable without professional oversight.\u003c/p\u003e\u003cp\u003eThe present study\u0026rsquo;s strengths include its systematic evaluation across multiple quality metrics, the use of standardized questions, and rigorous inter-rater reliability procedures. However, limitations must also be acknowledged. First, the study focused exclusively on English-language responses, which may not reflect performance in other languages. Second, the analysis was limited to fluoride-related content, and broader health topics may yield different results. Lastly, the simulated nature of the chatbot interactions may not fully capture real-world variability in user queries and follow-up questions.\u003c/p\u003e\u003cp\u003eFuture research should expand to include multilingual evaluations, assess responses to more complex patient scenarios, and explore the integration of chatbot tools within clinical workflows. Furthermore, collaborations between AI developers and health professionals could lead to the development of specialized medical models optimized for patient education and safety.\u003c/p\u003e\u003cp\u003eIn conclusion, while AI-based chatbots\u0026mdash;especially ChatGPT-4.o \u0026mdash;demonstrate promising capabilities in conveying fluoride-related information, their deployment in clinical or educational settings should be supported by validation, regulation, and ethical oversight to ensure safety and effectiveness.\u003c/p\u003e"},{"header":"5. CONCLUSION","content":"\u003cp\u003eThis study offers a novel and methodologically robust evaluation of how artificial intelligence chatbots convey fluoride-related information to pediatric patients and caregivers. Among the models assessed, ChatGPT-4.o consistently demonstrated superior performance across multiple metrics of quality, reliability, and readability. Its higher EQIP and DISCERN scores suggest that it is more capable of delivering accurate, evidence-based, and user-friendly health information than Gemini Pro or DeepSeek V3.\u003c/p\u003e\u003cp\u003eHowever, while ChatGPT-4.o shows potential as a digital health education tool, our findings also highlight the limitations and risks associated with relying on AI-generated content in clinical contexts. Variability in response quality, lack of clinical nuance, and moderate reading complexity underscore the need for cautious implementation and professional validation.\u003c/p\u003e\u003cp\u003e Ultimately, AI chatbots may serve as valuable adjuncts to professional guidance in promoting oral health literacy, especially in settings where access to care is limited. To fully realize their potential, future efforts should prioritize model refinement, transparency of sources, and user-centered design tailored to diverse literacy levels.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics Approval and Consent to Participate\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study did not involve human participants, personal data, or biological materials. It was based solely on the evaluation of publicly available AI-generated content. Therefore, ethics committee approval or Institutional Review Board (IRB) review was not required.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding Declaration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis research received no external funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eHuman Ethics and Consent to Participate Declarations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eHuman Ethics and Consent to Participate declarations: not applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll authors contributed significantly to the development of this manuscript.\u003c/p\u003e\n\u003cul type=\"disc\"\u003e\n \u003cli\u003e\u003cstrong\u003eNevra Karamüftüoğlu\u003c/strong\u003e: Conceptualization, methodology, data analysis, writing—original draft preparation.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eEzgi Aydın Varol\u003c/strong\u003e: Data collection, software implementation, review and editing.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eCenkhan Bal\u003c/strong\u003e: Evaluation framework design, statistical validation, visualization.\u003cbr\u003e\u0026nbsp;All authors read and approved the final version of the manuscript.\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no financial or non-financial competing interests related to the content of this manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availibility\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets used and/or analysed during the current study available from the corresponding author on reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBhattacharya, B. S. \u0026amp; Pissurlenkar, V. S. Assistive chatbots for healthcare: A succinct review. \u003cem\u003earXiv preprint arXiv:2308.04178\u003c/em\u003e. (2023). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2308.04178\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2308.04178\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBuldur, M. \u0026amp; Sezer, B. Evaluating the accuracy of Chat Generative Pre-trained Transformer version 4 (ChatGPT-4) responses to United States Food and Drug Administration (FDA) frequently asked questions about dental amalgam. \u003cem\u003eBMC Oral Health\u003c/em\u003e. \u003cb\u003e24\u003c/b\u003e, 605. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s12903-024-04358-8\u003c/span\u003e\u003cspan address=\"10.1186/s12903-024-04358-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCenters for Disease Control and Prevention (CDC). Recommendations for using fluoride to prevent and control dental caries in the United States. \u003cem\u003eMorb. Mortal. Wkly Rep.\u003c/em\u003e \u003cb\u003e50\u003c/b\u003e (RR14), 1\u0026ndash;42 (2001).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDeepSeek Research Team. \u003cem\u003eDeepSeek-LLM: A family of open LLMs and math reasoning models\u003c/em\u003e [arXiv preprint]. arXiv. (2025). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/pdf/2501.12948\u003c/span\u003e\u003cspan address=\"https://arxiv.org/pdf/2501.12948\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGugnani, N., Pandit, I. K., Gupta, M., Gugnani, S. \u0026amp; Kathuria, S. Parental concerns about oral health of children: Is ChatGPT helpful in finding appropriate answers? \u003cem\u003eJ. Indian Soc. Pedod. Prev. Dentistry\u003c/em\u003e. \u003cb\u003e42\u003c/b\u003e (2), 104\u0026ndash;111. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.4103/jisppd.jisppd_110_24\u003c/span\u003e\u003cspan address=\"10.4103/jisppd.jisppd_110_24\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHan, M. \u0026amp; Lee, J. Systematic review and meta-analysis of the effectiveness of chatbots for promoting physical activity, healthy eating, and sleep. \u003cem\u003enpj Digit. Med.\u003c/em\u003e \u003cb\u003e6\u003c/b\u003e (1), 236. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41746-023-00979-5\u003c/span\u003e\u003cspan address=\"10.1038/s41746-023-00979-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJin, E., Ryoo, Y., Kim, W. J. \u0026amp; Song, Y. G. Bridging the health literacy gap through AI chatbot design: The impact of gender and doctor cues on chatbot trust and acceptance. \u003cem\u003eInternet Res.\u003c/em\u003e \u003cb\u003e34\u003c/b\u003e (1), 123\u0026ndash;145. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1108/intr-08-2023-0702\u003c/span\u003e\u003cspan address=\"10.1108/intr-08-2023-0702\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLi, H., Zhang, R., Lee, Y. C., Kraut, R. E. \u0026amp; Mohr, D. C. Systematic review and meta-analysis of AI-based conversational agents for promoting mental health and well-being. \u003cem\u003enpj Digit. Med.\u003c/em\u003e \u003cb\u003e6\u003c/b\u003e (1), 236. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41746-023-00979-5\u003c/span\u003e\u003cspan address=\"10.1038/s41746-023-00979-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePupong, K., Hunsrisakhun, J., Pithpornchaiyakul, S. \u0026amp; Naorungroj, S. Development of chatbot-based oral health care for young children and evaluation of its effectiveness, usability, and acceptability: Mixed methods study. \u003cem\u003eJMIR Pediatr. Parent.\u003c/em\u003e \u003cb\u003e5\u003c/b\u003e (1), e62738. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/62738\u003c/span\u003e\u003cspan address=\"10.2196/62738\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSamaranayake, L., Porntaveetus, T., Tsoi, J. \u0026amp; Tuygunov, N. Facts and fallacies of the fluoride controversy: A contemporary perspective. \u003cem\u003eInt. Dent. J.\u003c/em\u003e \u003cb\u003e75\u003c/b\u003e (4), 100833. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.identj.2024.100833\u003c/span\u003e\u003cspan address=\"10.1016/j.identj.2024.100833\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2025).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSpittle, B. Using artificial intelligence to obtain sufficient and reliable answers to questions about fluoride. \u003cem\u003eFluoride\u003c/em\u003e \u003cb\u003e57\u003c/b\u003e (2), 197\u0026ndash;204 (2024). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.fluorideresearch.online/epub/files/233.pdf\u003c/span\u003e\u003cspan address=\"https://www.fluorideresearch.online/epub/files/233.pdf\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eten Cate, J. M. Contemporary perspective on the use of fluoride products in caries prevention. \u003cem\u003eBr. Dent. J.\u003c/em\u003e \u003cb\u003e214\u003c/b\u003e (4), 161\u0026ndash;167. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/sj.bdj.2013.127\u003c/span\u003e\u003cspan address=\"10.1038/sj.bdj.2013.127\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWeyant, R. J. et al. Topical fluoride for caries prevention: Executive summary of the updated clinical recommendations and supporting systematic review. \u003cem\u003eJ. Am. Dent. Assoc.\u003c/em\u003e \u003cb\u003e144\u003c/b\u003e (11), 1279\u0026ndash;1291. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.14219/jada.archive.2013.0057\u003c/span\u003e\u003cspan address=\"10.14219/jada.archive.2013.0057\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e (2013).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhou, Y., Oniani, D., Sreekumar, S., DeAlmeida, R. \u0026amp; Wang, Y. Toward improving health literacy in patient education materials with neural machine translation models. \u003cem\u003earXiv preprint arXiv:2209.06723\u003c/em\u003e. https://arxiv.org/abs/2209.06723 (2022).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Artificial Intelligence, Chatbots, Fluoride, Pediatric Dentistry","lastPublishedDoi":"10.21203/rs.3.rs-6993451/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6993451/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground:\u003c/h2\u003e\u003cp\u003eFluoride is a cornerstone of preventive pediatric dentistry, yet public concerns and online misinformation continue to undermine its acceptance. With the rise of large language model-based (LLM) chatbots, artificial intelligence (AI) tools have emerged as potential resources for delivering accessible, evidence-based health information.\u003c/p\u003e\u003ch2\u003eObjective:\u003c/h2\u003e\u003cp\u003eThis study aims to evaluate the performance of three advanced AI chatbots\u0026mdash;ChatGPT-4.o, Google Gemini Pro, and DeepSeek V3\u0026mdash;in providing fluoride-related information to parents and caregivers, with a specific focus on pediatric dental health.\u003c/p\u003e\u003ch2\u003eMethods:\u003c/h2\u003e\u003cp\u003eTwenty frequently asked fluoride-related questions were presented to each chatbot in standardized sessions. Responses were assessed by three blinded evaluators using validated tools: EQIP, DISCERN, Global Quality Scale (GQS), Flesch Reading Ease Score (FRES), Flesch-Kincaid Reading Grade Level (FKRGL), and iThenticate similarity index. Inter-rater reliability was ensured via intraclass correlation coefficients (ICCs). Statistical analysis was performed using ANOVA or Kruskal\u0026ndash;Wallis tests, with appropriate post-hoc methods.\u003c/p\u003e\u003ch2\u003eResults:\u003c/h2\u003e\u003cp\u003eChatGPT-4.o outperformed the other models in EQIP and DISCERN scores (p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), indicating higher reliability and informational quality. While FRES and Similarity Index showed no significant differences, ChatGPT produced more readable and original content. All three models showed moderate variability in FKRGL and GQS outcomes.\u003c/p\u003e\u003ch2\u003eConclusion:\u003c/h2\u003e\u003cp\u003eAmong the evaluated AI chatbots, ChatGPT-4.o demonstrated superior performance in conveying fluoride-related information in a clear, reliable, and evidence-based manner. While promising as educational tools in pediatric oral health, these models should be complemented with professional oversight to ensure accuracy and appropriateness in clinical use.\u003c/p\u003e","manuscriptTitle":"Ai Chatbots for Pediatric Fluoride Education: An Effectiveness Study","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-08-11 07:09:32","doi":"10.21203/rs.3.rs-6993451/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-09-04T06:02:46+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-14T06:02:29+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-09T16:06:47+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"215488240589537909226673548905489544381","date":"2025-08-09T14:54:14+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-05T15:19:59+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-04T14:17:45+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"171477369826590864375758949523919143591","date":"2025-08-04T04:13:28+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"190881917131505969449608585522612039215","date":"2025-08-03T17:54:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"203664839002222863652113848474309662390","date":"2025-08-03T01:58:00+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"203955717876583815191224209799213661833","date":"2025-08-02T17:09:58+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"86848224410180683031252827566367295649","date":"2025-08-02T15:21:00+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-08-02T15:13:01+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-08-02T15:10:10+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-07-11T08:52:45+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-07-01T12:01:33+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-07-01T11:58:38+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f08f2430-8346-4780-b8bb-30452941f4a0","owner":[],"postedDate":"August 11th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":52890911,"name":"Health sciences/Health care"},{"id":52890912,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2025-12-01T16:00:49+00:00","versionOfRecord":{"articleIdentity":"rs-6993451","link":"https://doi.org/10.1038/s41598-025-28857-y","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-11-29 15:57:21","publishedOnDateReadable":"November 29th, 2025"},"versionCreatedAt":"2025-08-11 07:09:32","video":"","vorDoi":"10.1038/s41598-025-28857-y","vorDoiUrl":"https://doi.org/10.1038/s41598-025-28857-y","workflowStages":[]},"version":"v1","identity":"rs-6993451","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6993451","identity":"rs-6993451","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.