Poetic or Prosaic? Evaluating the Linguistic Quality of AI-Generated Draft Replies to Patient Portal Messages | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Poetic or Prosaic? Evaluating the Linguistic Quality of AI-Generated Draft Replies to Patient Portal Messages Gavin Hui, Laura Prichard, Taylor Martin, Sitaram Vangala, Joshua Khalili, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7909272/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 13 You are reading this latest preprint version Abstract Background The use of generative artificial intelligence (genAI) in healthcare is increasing, including the use of GPT-generated draft replies (GDRs) to patient messages via Epic Systems’ electronic health record (EHR). We evaluated GDR use, quality, and impact in a large academic health system. Methods Thirty primary care physicians received GDRs from September 2023 to August 2024 during a staged rollout. Messages were grouped into baseline (GDRs not shown) and intervention (GDRs used). We evaluated messages using BLEU, ROUGE, cosine similarity, BERTScore, token counts and Flesch Reading Ease. We compared baseline and intervention groups, and across prompt refinement phases (Phases 2–4 vs. Phase 1). Blinded evaluations of message quality were conducted via surveys, and BERTScores were correlated with physician evaluations on effectiveness, misunderstanding, and harm. Results Of 66,200 GDRs generated, 21,073 were presented, and 2,264 (11%) were used. Used GDRs showed alignment with final messages [(BLEU 0.49 (95% CI: 0.43–0.56), ROUGE-L 0.60 (0.54–0.66)], with high BERTScores (F1 > 0.9). Final messages were longer and more readable. Prompt refinements increased token retention. GDR usage declined over time, yet providers reported time savings and reduced cognitive load. BERTScores correlated strongly with physician feedback on effectiveness and safety in the intervention group. Conclusions GPT-generated drafts show strong semantic alignment with physician messages and may support efficient communication. However, usage trends and readability challenges underscore the need for improved prompt design and better workflow integration. Quantitative metrics like BERTScore, when paired with physician feedback, offer a scalable framework for evaluating AI-assisted messaging in healthcare. Biological sciences/Computational biology and bioinformatics Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research Generative AI Electronic health records Clinical communication Natural language processing Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction The integration of artificial intelligence (AI) into healthcare has ushered in a new era of innovation, offering promising solutions to longstanding challenges faced by healthcare providers.[ 1 , 2 ] As healthcare systems digitized, electronic health records (EHRs) have become indispensable tools for managing patient information and facilitating communication.[ 3 ] However, EHRs have increased administrative tasks, contributing to clinician burnout and patient care challenges.[4–7 ] Patient portal messages, a primary contributor to increased ‘pajama time’, have risen substantially with the expansion of telehealth during the global pandemic.[ 8 , 9 ] In recent years, AI technologies have emerged as transformative tools, offering opportunities to streamline workflows and enhance communication efficiency.[ 10 , 11 ] In this context, the advent of genAI, utilizing large language models (LLMs) and natural language processing (NLP), presents a promising opportunity to alleviate the burden of administrative tasks and enhance the overall efficiency of healthcare delivery. Powered by transformer neural network architecture, LLMs have already demonstrated astonishing capabilities in healthcare. They can now complete documentation, such as prior authorizations, outperform humans on medical licensing exams and interpret electrocardiograms.[ 12 – 14 ] In non-clinical scenarios, ChatGPT responses have demonstrated superior empathy and better quality than physician responses to patient messages posted to a social media forum, and more “balanced, complete, empathetic, and helpful” counseling than widely known professional advice columnists.[ 15 , 16 ] However, early real-world experiences with Epic’s Augmented Response Technology have shown mixed results – no significant time savings, but paradoxically, a subjective sense of relief from EHR charting burden and an appreciation for the potential value of the tool.[ 17 , 18 ] Given early reports of low usage and limited efficiency gains, but suggestions of qualitative value of GDRs, we sought to illuminate what constitutes an “acceptable” AI-generated message. We aimed to provide a comprehensive perspective on evaluating genAI tools in clinical practice. By integrating traditional NLP metrics with human evaluation, including physician surveys, stratified review, and end-user feedback, we sought to inform real-world implementation strategies. Methods Study Design UCLA Health was an early adopter of an AI-generated draft response (GDR) pilot tool. Starting September 27th, 2023, nine primary care physicians (PCPs) across six outpatient clinics were sequentially added as pilot users, and all had access by November 2023. Initial users were selected based on EHR aptitude (i.e., physician informaticists) and message volume. Since GDRs are generated for all PCPs in a specific clinic (whether made visible or not), the expansion users included the remaining physicians in each pilot clinic. On February 22, 2024, 21 expansion users were added; two later left practice. Education consisted of live and recorded webinars and tip sheets. Once activated, providers began receiving GDRs for online portal questions. All providers gave verbal consent. This study was reviewed by the UCLA Institutional Review Board (Office of the Human Research Protection Program) and deemed exempt under institutional policy IRB# 24-001342. All procedures were conducted in accordance with relevant guidelines and regulations, including the Declaration of Helsinki and its later amendments. GDRs were generated if the message fell into one of four Epic-managed categories: Medication, Paperwork, Results, and General. Questioners had to be 18 or older and not a proxy (e.g., a caregiver). UCLA Health had the ability to draft and edit four separate prompts, each tailored to a category. We used Epic’s suggested starter prompts, offered to all participating systems. An example prompt is shown in Fig. 1 . Four prompt edits occurred during the study period: 10/02/23: “Do not include a signoff. Let the provider end the response.” 11/06/23: “Limit the response to a maximum of 100 words.” 01/28/24: Removed a SmartLink (native Epic function) mapped to patient specific appointment data and deleted text in the prompt referencing the word ‘appointment.’ 05/09/24: Added a SmartLink to include the text of the message ‘Subject’ line in the prompt. We analyzed LLM/NLP metrics, token count, and Flesch reading-ease across four phases defined by prompt changes (Phases 1–4). This observational study was deemed exempt from UCLA IRB review. Data Source and Characteristics Data included the patient message, the unedited GDR, and the final physician message sent to the patient. Due to tool design, GDRs were generated for all physicians in a clinic, regardless of individual activation. We extracted a dataset of GDRs not shown to providers, representing the ‘baseline’ group. In this group, GPT and physician responses were independently generated for the same message, allowing for a comparison between unsupervised, unedited AI drafts and physician-written messages. A separate ‘intervention’ group included GDRs that were shown, edited, and sent by providers. Data Pre-processing, Cleaning, and Manipulation Text Pre-Processing We applied text pre-processing steps from the Python Natural Language Toolkit ( nltk library) and custom functions. For bag-of-words and cosine similarity, we split the messages up into cleaned tokens using the following steps: removal of common signatures and rich text; expansion of contracted words (e.g., I’ll > I will); conversion to lowercase; and removal of punctuation and stop words. Lemmatization was not used, as word forms were relevant for comparison. We calculated the percentage of GDR tokens used and the percentage of final messages composed of GDR tokens. Cosine similarity was computed by vectorizing the tokens using the term frequency-inverse document frequency (TF-IDF) vectorizer from the sklearn Python library. Metric Calculation For calculating the BLEU, ROUGE and BERTScore, we applied basic cleaning (removal of signatures and rich text) and no other pre-processing. We derived the BLEU and ROUGE scores using the Python evaluate library, and BERTScore precision, recall and F1 metrics using the bert_score library. For Flesch-Kincaid readability, we used basic cleaning and calculated average syllables per word and words per sentence using nltk functions. Message Comparison Metrics To assess lexical and syntactic overlap between GDRs and final patient messages, we used a combination of traditional NLP and LLM-based metrics: cosine similarity, BLEU, ROUGE, and BERTScore. Cosine similarity with term frequency-inverse document frequency (TF-IDF) analyzes and weighs individual terms within texts and can be used to highlight the importance of a word in a document relative to a collection of documents in the source data. BLEU (Bilingual Evaluation Understudy) prioritizes precision by evaluating how many n-grams (common word strings) from the GDR appear in the final response.[ 19 ] ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap of n-grams and the longest common subsequences between the GDR and the final response, and quantifies how much of the GDR was retained for the final response.[ 20 ] BERTScore assesses semantic similarity between the generated and reference texts.[ 21 ] This metric offers insight into the alignment of the core meaning between GDRs and physician responses. Readability We compared the readability of the GDRs against the physician-sent messages using Flesch reading-ease test and length by calculating average token count.[ 22 ] Qualitative Evaluations We performed two qualitative evaluations for this study. First, we distributed a user survey (Supplement 3) with 5-point Likert scale questions and a Net Promoter Score to gauge subjective impressions of the AI tool.[ 23 ] Second, we evaluated how well BERTScores reflect semantic alignment and message quality using physician review. Three board-certified physicians assessed 120 randomized message pairs, 60 with GPT-generated drafts (intervention) and 60 without (baseline), sampled from BERTScore tertiles. Each pair included an unedited GDR and the corresponding final provider message (either a physician-edited GDR or fully physician-composed response). Reviewers, blinded to group, rated each pair on a 5-point scale across four criteria: Effectiveness of the GPT draft in answering the patient’s question. Potential for misunderstanding. Potential for harm. Overall preference for the GPT-generated draft versus the physician's final message. Statistical Analysis For each metric, we report mean scores by group (baseline vs. intervention) and by intervention phase, with 95% confidence intervals. Estimates were obtained using linear regression with standard errors clustered at the provider level. For the qualitative evaluation of BERTScores and physician ratings, we computed Spearman correlation coefficients to assess the relationship between semantic similarity and physician Likert ratings. Statistical significance was defined as two-sided p < 0.05. Results Overview of GDR Usage From September 27, 2023, to August 4, 2024, 66,200 GDRs were generated. The baseline group (unseen drafts) included 45,127 messages. The intervention group (GDRs shown to providers) totaled 21,073, of which 2,264 (11%) were used in final responses. Trends in GDR generation and usage over time are shown in Fig. 2 . Intervention: Used GPT-generated Draft Responses (GDRs) 1. LLM Metrics In the intervention group, a substantial proportion of AI-generated tokens were retained in the final physician-sent messages, with an average token retention rate of 62.22% (95% CI, 55.97–68.46). Cosine similarity ratio was 0.74 (0.71–0.77). BLEU and ROUGE scores reflected consistent lexical overlap (Table 1 ). BERTScore values were uniformly high, with F1 score 0.93 (0.91–0.94) suggesting strong semantic similarity between AI-generated drafts and the finalized messages. 2. Token Count The final physician-edited messages were consistently longer compared to the initial GPT-draft responses (GDRs). GDRs averaged 32.56 tokens (30.14–34.98), whereas final sent messages contained 51.96 tokens (21.97–81.95), reflecting a 59.5% increase in length following physician edits. 3. Flesch Reading-Ease Analysis The average readability score of GPT-generated drafts was 41.42 (40.85–42.00). After physician review and editing, final messages had improved readability, with an average Flesch Reading Ease score of 45.51 (43.91–47.10). Despite the improvement, both draft and final messages remained in the college-level readability range. Baseline: Unseen & Unused GPT-generated Draft Responses (GDRs) 1. LLM Metrics In this group, both GDR and the physician-written responses were independently crafted in response to an identical patient message. Thus, as expected, the percentage of AI-generated tokens in the final messages was significantly lower. Cosine similarity ratio was 0.27. BLEU was 0, while ROUGE1 (0.17), ROUGE2 (0.02), and ROUGE-L (0.12) likewise remained low. BERTScore F1 was moderate at 0.83. 2. Token Count The average GDR token count was 29.24 (28.24–30.24) and the average token count in the final physician-sent messages was 31.86 (26.20-37.52). The GPT-generated responses were similar in length to physician independently written messages. 3. Flesch Reading Ease The average Flesch reading ease score for GDRs was 41.76 (41.23–42.28). The scores for the final sent messages were higher, with an average score of 46.70 (42.02–51.37). Intervention vs. Baseline Statistical Comparisons Several LLM metrics showed statistically significant differences between groups (Table 1 . The BLEU score differed by 0.49 (0.42–0.55, p < 0.001), and ROUGE scores also showed notable differences, reflecting lexical and structural alignment in the intervention group compared to baseline. Measures of semantic similarity, including BERTScore precision, recall, and F1, were each higher by approximately 0.10 points (all p < 0.001), indicating greater semantic alignment in the intervention group. The number of tokens in GPT-generated drafts (GDRs) was significantly higher in the intervention group, with a difference of 3.32 tokens (0.65–5.99, p = 0.015), suggesting modestly longer drafts when providers had access to GPT assistance. The final physician-sent messages were also longer in the intervention group by 20.10 tokens, but this difference did not reach statistical significance (–9.86 to 50.07, p = 0.189). In contrast, Flesch Reading Ease scores did not differ significantly between groups. For final sent messages, the difference was − 1.19 (-6.05 to 3.68, p = 0.632), suggesting that while message content improved in structure and semantic fidelity, overall readability remained comparable between intervention and baseline. Results of Qualitative Evaluations 1. Physician Survey Feedback In a survey of 16 clinicians, the GDR tool received mixed but generally positive feedback. Clinicians found the tool helpful overall, with a mean rating of 3.76 (95% CI: 3.07–4.46) for being “more of a help than a burden” and offering a “time-saving benefit” (mean 3.53, 95% CI: 2.90–4.16). Respondents also reported a reduction in cognitive load, with an average rating of 3.59 (2.88–4.29).When rating how well the tool responded across message types, General messages were rated the highest (3.29, 2.64–3.94), followed by Results (2.82, 2.19–3.46), Paperwork (2.59, 1.96–3.22), and Medication queries (2.29, 1.70–2.89). Clinicians’ likelihood of recommending the tool to a colleague had a Net Promoter Score (NPS) of + 6.3, indicating a slightly positive overall recommendation tendency (NPS calculated as the percentage of promoters minus the percentage of detractors). User-facing survey questions are shown in Fig. 3 . 2. Physician Evaluation by BERTScore Tertiles ( Table 3 ) In the baseline group, GPT drafts were rated neutral in answering questions (mean, 2.97) with low to moderate potential for misunderstanding (2.25) and very low potential for harm (1.43), but there was a strong preference for physician-crafted messages (4.24). In the intervention group, there was still a slight preference for physician-edited messages (3.99), and GPT drafts received high ratings for answering questions and maintaining low risks of misunderstanding or harm. Survey questions are shown in Fig. 4 . Correlation Analysis Spearman correlation showed strong positive relationships between BERTScores and physician Likert ratings in the intervention group across all metrics (Table 3 ). In the baseline group, there were minimal to no correlation. Evaluation of Prompt Changes We conducted a stratified analysis of the intervention group by prompt phase to assess changes following iterative prompt modifications. All statistical comparisons were clustered by provider. BLEU, ROUGE-1, ROUGE-2, and ROUGE-L remained stable across phases, with no statistically significant differences. Cosine similarity showed a modest decline in Phase 4 compared to Phase 1. BERTScore F1 remained consistent across phases, with no significant differences observed. Draft message length declined significantly after Phase 1 (mean: 52.60 tokens), when the prompt was adjusted to produce a shorter length. Compared to Phase 1, the reduction in draft length was statistically significant in all subsequent phases (Table 2 ). Final message length also declined from Phase 1 (71.06 tokens), with a significant drop in Phase 2 (Δ: − 29.76, − 57.89 to − 1.63, p = 0.038); later phases showed no significant difference. In all phases, final physician-sent messages were consistently longer than the GPT-generated drafts, indicating that physicians tended to expand upon the initial AI-generated content. Readability scores (Flesch Reading Ease) for both drafts and final messages were stable across phases, with no significant changes. Discussion This study presents a comprehensive evaluation of GPT-generated draft replies in patient care communication, offering one of the first real-world, multi-angle assessments of their semantic fidelity, readability, and perceived utility. By combining quantitative LLM metrics, human validation, and usage data, our findings offer highlight both the potential and challenges of integrating artificial intelligence (AI) into clinical messaging. In the intervention group where GDRs were used, we observed high lexical, syntactic, and semantic alignment between GDRs and final physician-edited messages. BERTScores consistently exceeded 0.9, suggesting strong semantic overlap and effective capture of core clinical intent. This may indicate that GDRs provided a solid communication foundation, reducing the need for major revisions. However, such alignment may also reflect automation bias, with clinicians potentially deferring to AI-generated content, raising concerns about over-reliance. Analysis by BERTScore tertiles revealed that alignment between BERTScores and physician survey ratings was only meaningful when drafts were used. In the intervention group, higher BERTScores correlated with favorable Likert ratings, including effectiveness, clarity, and low risk of harm. In contrast, the baseline group showed minimal correlation, underscoring that BERTScores primarily reflect similarity, not intrinsic quality. GPT drafts were independently rated to have low to moderate potential for misunderstanding and very low potential for harm across both groups. Still, a clear preference for physician-written or -edited messages underscores the continued importance of maintaining a human touch, even as AI aims to enhance efficiency and reduce burden. Our study also highlighted the complementary role of AI in crafting messages. When providers wrote their own message independently, the average token length was similar to the draft. However, when GDRs were used, final message lengths were consistently longer, suggesting that providers added clinical details or stylistic changes. While we did not survey perceived empathy of the messages, prior research has shown congruency between longer messages and increased perceived empathy.[ 15 ] This finding underscores the importance of viewing GDRs as a collaborative tool. GDRs served as a valuable starting point that providers could refine and personalize. Areas for improvement remain. Both GPT drafts and physician sent messages consistently scored at “college-level” readability, which is well above the 6th -grade reading level suggested for patient education materials.[ 24 ] Future iterations should incorporate prompt engineering prioritizing this to better serve patients with varying health literacy. In addition to improving readability, prompt engineering can also enhance the quality, tone, and acceptance of messages. For example, Yan, et. al., demonstrated that iterative prompt adjustments, informed by provider and patient feedback, improved message quality and tone.[ 25 ] In our implementation, iterative prompt updates shortened the message length. Targeted prompt modifications may improve physician uptake and integration of AI-generated content into their workflow. Despite promising findings, our results revealed a steady decline in GDR usage rates throughout the study. While usage declined, provider feedback revealed high levels of satisfaction and perceived benefits among healthcare providers. Surveyed providers reported a sense of time savings and reduced cognitive load and found the tool more helpful than burdensome. The Net Promoter Score average of 5.88 support a somewhat positive reception. Our findings align with those reported in a study by Garcia et al., which noted significant reductions in physician task load and work exhaustion scores with implementation of AI-generated draft replies.[ 17 ] The discrepancy between perceived utility and actual usage highlights the complex nature of technology adoption in healthcare settings. These findings align with an American Medical Association (AMA) survey, where 65% of physicians viewed AI as advantageous to patient care, yet only 21% actively used it.[ 26 ] This gap between perceived potential and actual implementation underscores the need for strategies to improve adoption rates and integration into clinical workflows. Addressing barriers to adoption, such as enhancing workflow integration, increasing AI literacy, improving tone and readability of genAI text, and personalizing AI outputs, will be critical to bridging this gap and ensuring sustained use of these tools. Furthermore, evaluating genAI implementations remains challenging due to reliance on human review, which is both resource- and time-intensive.[ 27 , 28 ] Our study sought to address this by combining quantitative LLM metrics with qualitative provider evaluations to provide a comprehensive and intuitive framework for comparing GDRs and final responses. While BERTScore is a common proxy for semantic similarity, it has limitations in healthcare-specific contexts, where clinical nuance is critical. Emerging frameworks like MedHELM and PDSQI-9 emphasize the need for structured, clinician-aligned evaluations that capture clinical relevance, safety, and patient-centered communication. These are areas where LLM metrics alone fall short.[ 29 , 30 ] Our use of clinician Likert ratings for assessment of semantic fidelity, risk, and helpfulness aligns with this emerging consensus. These frameworks underscore the growing recognition that human-centered, task-specific evaluations are essential for robust assessment of LLMs in clinical workflows. Limitations While we evaluated this AI tool with quantitative LLM metrics and qualitative surveys, limitations remain. Due to technical limitations, we could not evaluate drafts that may have been seen by a provider but discarded, and we were unable to definitively ascertain whether a draft was seen by a provider. This single-center study included only physicians, though many in-basket workflows involve non-physician clinical workers. Thus, findings may not reflect broader team-based use. Future studies should be multi-center and include diverse specialties and roles. Lastly, our study evaluated GPT-4 as implemented in Epic’s In Basket Art, future GPT iterations may produce different results and should be reassessed in a future study. Conclusion GPT-generated draft replies (GDRs) demonstrate strong semantic alignment with physician-edited messages and offer a promising foundation for enhancing patient-provider communication. Our findings highlight both the utility and limitations of LLM metrics such as BERTScore, which aligned with physician judgment only when GDRs were actively used. However, by combining quantitative metrics with structured human evaluation, this study reflects evolving potential best practices for LLM assessment in healthcare, and provides a pragmatic framework for health systems to assess, monitor, and refine AI-powered patient messaging tools. Declarations Competing Interests All authors declare no financial or non-financial competing interests. Human Ethics and Consent to Participate Human Ethics and Consent to Participate declarations: not applicable. Clinical Trial Number Not Applicable Funding This study received no funding. Author Contribution T.M., P.J.L., L.P., and G.H. conceived and designed the study. S.S.V. performed the primary statistical analysis, with additional statistical input from L.P. L.P. also conducted the natural language processing analyses, including BERTScore, BLEU, and related metrics. S.M.Y., H.E.W., and J.K. served as provider reviewers for the survey data. All authors contributed to the interpretation of results, revised the manuscript critically for important intellectual content, and approved the final version for submission. Acknowledgement The authors thank the UCLA Health Clinical Informatics team and participating physicians for their support of this project. Data Availability Data supporting the findings of this study are available from UCLA Health but restrictions apply to protect patient privacy. De-identified data may be available from the corresponding author upon reasonable request. References Topol, E. J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019). Yu, K. H., Beam, A. L. & Kohane, I. S. Artificial intelligence in healthcare. Nat. Biomed. Eng. 2, 719–731 (2018). Adler-Milstein, J. et al. Electronic health record adoption in US hospitals: The emergence of a digital "advanced use" divide. J. Am. Med. Inform. Assoc. 24, 1142–1148 (2017). Shanafelt, T. D., Dyrbye, L. N. & West, C. P. Addressing physician burnout: The way forward. JAMA 317, 901–902 (2017). Sinsky, C. et al. Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties. Ann. Intern. Med. 165, 753–760 (2016). McPeek-Hinz, E. et al. Clinician burnout associated with sex, clinician type, work culture, and use of electronic health records. JAMA Netw. Open 4, e2114066 (2021). Rotenstein, L. S. et al. Differences in clinician electronic health record use across adult and pediatric primary care specialties. JAMA Netw. Open 4, e2117244 (2021). Akbar, F. et al. Physicians’ electronic inbox work patterns and factors associated with high inbox work duration. J. Am. Med. Inform. Assoc. 28, 923–930 (2021). Nath, B. et al. Trends in electronic health record inbox messaging during the COVID-19 pandemic in an ambulatory practice network in New England. JAMA Netw. Open 4, e2131490 (2021). Maddox, T. M., Rumsfeld, J. S. & Payne, P. R. O. Questions for artificial intelligence in health care. JAMA 321, 31–32 (2019). Davenport, T. & Kalakota, R. The potential for artificial intelligence in healthcare. Future Healthc. J. 6, 94–98 (2019). Diane, A., Gencarelli, P. J. Jr, Lee, J. M. Jr & Mittal, R. Utilizing ChatGPT to streamline the generation of prior authorization letters and enhance clerical workflow in orthopedic surgery practice: A case report. Cureus 15, e49680 (2023). Brin, D. et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci. Rep. 13, 16492 (2023). Günay, S., Öztürk, A., Özerol, H. et al. Comparison of emergency medicine specialist, cardiologist, and ChatGPT in electrocardiography assessment. Am. J. Emerg. Med. 80, 51–60 (2024). Ayers, J. W. et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern. Med. 183, 589–596 (2023). Howe, P. D. L. et al. ChatGPT’s advice is perceived as better than that of professional advice columnists. Front. Psychol. 14, 1281255 (2023). Garcia, P. et al. Artificial intelligence-generated draft replies to patient inbox messages. JAMA Netw. Open 7, e243201 (2024). Tai-Seale, M. et al. AI-generated draft replies integrated into health records and physicians’ electronic communication. JAMA Netw. Open 7, e246565 (2024). Papineni, K., Roukos, S., Ward, T. & Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In Proc. 40th Annu. Meet. Assoc. Comput. Linguist. 311–318 (ACL, 2002). Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proc. ACL Workshop 74–81 (ACL, 2004). Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q. & Artzi, Y. BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019). Flesch, R. A new readability yardstick. J. Appl. Psychol. 32, 221–233 (1948). Adams, C. et al. The ultimate question? Evaluating the use of Net Promoter Score in healthcare: A systematic review. Health Expect. 25, 2328–2339 (2022). Cotugna, N., Vickery, C. E. & Carpenter-Haefele, K. M. Evaluation of literacy level of patient education pages in health-related journals. J. Community Health 30, 213–219 (2005). Yan, S. et al. Prompt engineering on leveraging large language models in generating response to InBasket messages. J. Am. Med. Inform. Assoc. 31, 2263–2270 (2024). AMA Augmented Intelligence Research. American Medical Association (2023). (Accessed 31 May 2024). Bedi, S. et al. Testing and evaluation of health care applications of large language models: A systematic review. JAMA preprint (2024). doi: 10.1101/2024.04.15.24305869 Awasthi, R. et al. HumanELY: Human evaluation of LLM yield using a novel web-based evaluation tool. medRxiv preprint (2023). doi: 10.1101/2023.12.22.23300458v2 Garcia, P. et al. MedHELM: A human evaluation benchmark for large language models in medicine. NPJ Digit. Med. 7, 113 (2024). Zhang, Y. et al. PDSQI-9: Clinician-aligned evaluation for patient-directed summaries. arXiv preprint arXiv:2505.23802 (2025). Tables Table 1. Natural language processing metrics comparing GPT-generated draft replies and final physician messages Metric Baseline : Mean (95% CI) Intervention: Mean (95% CI) p -values Draft Token Length 29.24 (28.24–30.23) 32.56 (30.14–34.98) 0.015 Final Message Token Length 31.86 (26.20–37.52) 51.96 (21.97–81.95) 0.189 Draft Tokens Retained (%) 13.61 (12.58–14.63) 62.22 (55.97–68.46) <0.001 Cosine Similarity 0.27 (0.25–0.28) 0.74 (0.71–0.77) <0.001 BLEU Score 0.00 (0.00–0.00) 0.49 (0.43–0.56) <0.001 ROUGE1 Score 0.17 (0.17–0.18) 0.62 (0.57–0.68) <0.001 ROUGE2 Score 0.02 (0.02–0.03) 0.55 (0.49–0.61) <0.001 ROUGE-L Score 0.12 (0.11–0.12) 0.60 (0.54–0.66) <0.001 BERTScore F1 0.83 (0.83–0.84) 0.93 (0.91–0.94) <0.001 Flesch-Kincaid Reading Ease – Draft 41.76 (41.23–42.28) 41.42 (40.85–42.00) 0.393 Flesch-Kincaid Reading Ease – Final Message 46.70 (42.02–51.37) 45.51 (43.91–47.10) 0.632 Table 2. Physician survey results stratified by BERTScore tertiles in baseline and intervention groups Metric Scale Baseline Group (Mean [95% CI]) Intervention Group (Mean [95% CI]) How well GPT answered the question 1 = Very ineffective, 5 = Very effective 2.97 (2.69–3.24) (Neutral) 3.95 (95% CI : 3.75–4.15 ) (Effective) Potential for misunderstanding 1 = No potential for misunderstanding, 5 = Very high potential for misunderstanding 2.25 (2.02–2.48) (Low to Moderate) 1.45 (95% CI: 1.31–1.58 ) (Very Low) Potential for harm 1 = No potential for harm, 5 = Very high potential for harm 1.43 (1.29–1.58) (Very Low) 1.21 (95% CI: 1.11–1.30 ) (Very Low) Preference for GPT or doctor 1 = Strongly prefer GPT-generated messages, 5 = Strongly prefer doctor’s final messages 4.24 (4.03–4.45) (Strong Preference for Doctor) 3.99 (95% CI: 3.81–4.17 ) (Slight Preference for Doctor) Table 3. Natural language processing metrics and message length across intervention prompt refinement phases (Phases 1–4) Metric Phase 1: Mean (95% CI) Phase 2: Mean (95% CI) Phase 3: Mean (95% CI) Phase 4: Mean (95% CI) p-values Draft Token Length 52.60 (48.94–56.26) 32.84 (30.93–34.75) 29.87 (29.25–30.49) 29.46 (28.13–30.78) <0.001 Final Message Token Length 71.06 (23.90–118.22) 41.30 (23.35–59.25) 53.02 (20.75–85.28) 48.68 (17.86–79.50) <0.001 Draft Tokens Retained (%) 67.64 (59.38–75.89) 64.59 (57.43–71.74) 61.35 (53.11–69.59) 60.25 (54.45–66.06) 0.026 Cosine Similarity 0.71 (0.65–0.77) 0.75 (0.71–0.80) 0.75 (0.72–0.79) 0.73 (0.71–0.75) 0.038 BLEU Score 0.48 (0.39–0.57) 0.51 (0.44–0.59) 0.49 (0.41–0.58) 0.48 (0.42–0.54) 0.982 ROUGE1 Score 0.62 (0.54–0.70) 0.64 (0.58–0.70) 0.63 (0.56–0.69) 0.61 (0.56–0.66) 0.651 ROUGE2 Score 0.54 (0.47–0.62) 0.57 (0.50–0.64) 0.56 (0.48–0.64) 0.54 (0.48–0.60) 0.911 ROUGE-L Score 0.59 (0.51–0.67) 0.61 (0.55–0.68) 0.60 (0.53–0.67) 0.58 (0.53–0.64) 0.916 BERTScore F1 0.92 (0.91–0.94) 0.93 (0.92–0.94) 0.93 (0.91–0.94) 0.92 (0.91–0.93) 0.979 Flesch-Kincaid Reading Ease – Draft 41.36 (39.33–43.39) 40.92 (39.37–42.47) 41.77 (41.22–42.31) 41.10 (40.36–41.85) 0.797 Flesch-Kincaid Reading Ease – Final 44.10 (40.26–47.93) 45.14 (43.83–46.45) 46.03 (44.22–47.84) 45.30 (43.04–47.56) 0.555 Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 21 Dec, 2025 Reviews received at journal 19 Dec, 2025 Reviews received at journal 11 Dec, 2025 Reviewers agreed at journal 10 Dec, 2025 Reviewers agreed at journal 08 Dec, 2025 Reviewers agreed at journal 08 Dec, 2025 Reviewers agreed at journal 08 Dec, 2025 Reviewers agreed at journal 08 Dec, 2025 Reviewers agreed at journal 08 Dec, 2025 Reviewers invited by journal 08 Dec, 2025 Editor assigned by journal 28 Oct, 2025 Submission checks completed at journal 28 Oct, 2025 First submitted to journal 20 Oct, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7909272","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":558398803,"identity":"b08d7eef-1254-4d7f-84ee-3e9274f53a47","order_by":0,"name":"Gavin Hui","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/0lEQVRIie3NMWsCMRTA8YSCU1LXyLX6FSIBETJ07Ne44+BuOdBuDg4pB1ldFb9Ix5MHumQuwk2l0FkolA4i3kXoUHjYsUP+Q0hCfnmEhEL/tIoawrrNRlzON6ZdrpOe8UQ2Z3qd+Dey+isZPLohPL3AnXqFTT2dHYncbZ4PZKYTg5ChKySsHLBRnaV66SSRLikFcTlOTBYDty0pRhG3DdlTI6gFnCw+LkStJ18RP3lSftMTTgYirTyRUdGJuPHECmpwIsV7S3Im6kxptlWs5xI7jre5QqcskvKTW/3QXadvNZv3+7c7gP1hru/RKdWvC+bXGHnup2B/hUKhUOinM3yNWnvaIBfXAAAAAElFTkSuQmCC","orcid":"","institution":"University of California, Los Angeles","correspondingAuthor":true,"prefix":"","firstName":"Gavin","middleName":"","lastName":"Hui","suffix":""},{"id":558398804,"identity":"5bde64d0-6064-4d6a-bdad-a9ff2a0525eb","order_by":1,"name":"Laura Prichard","email":"","orcid":"","institution":"UCLA Health Information Technology, UCLA Health, University of California, Los Angeles","correspondingAuthor":false,"prefix":"","firstName":"Laura","middleName":"","lastName":"Prichard","suffix":""},{"id":558398806,"identity":"a857fadc-8f2c-43e7-9309-3a791bf56391","order_by":2,"name":"Taylor Martin","email":"","orcid":"","institution":"Gillett Health","correspondingAuthor":false,"prefix":"","firstName":"Taylor","middleName":"","lastName":"Martin","suffix":""},{"id":558398808,"identity":"e2cb306d-8f14-4d2e-bc7f-a5acefdabbab","order_by":3,"name":"Sitaram Vangala","email":"","orcid":"","institution":"University of California, Los Angeles","correspondingAuthor":false,"prefix":"","firstName":"Sitaram","middleName":"","lastName":"Vangala","suffix":""},{"id":558398814,"identity":"6e98a14b-3120-409d-84ab-1ba3f1e306c5","order_by":4,"name":"Joshua Khalili","email":"","orcid":"","institution":"University of California, Los Angeles","correspondingAuthor":false,"prefix":"","firstName":"Joshua","middleName":"","lastName":"Khalili","suffix":""},{"id":558398819,"identity":"98c17449-8179-4d38-a16a-52289563ffff","order_by":5,"name":"Sun M. Yoo","email":"","orcid":"","institution":"University of California, Los Angeles","correspondingAuthor":false,"prefix":"","firstName":"Sun","middleName":"M.","lastName":"Yoo","suffix":""},{"id":558398820,"identity":"759a3635-d2ad-40ab-8351-5c1c07654c44","order_by":6,"name":"Hawkin E. Woo","email":"","orcid":"","institution":"University of California, Los Angeles","correspondingAuthor":false,"prefix":"","firstName":"Hawkin","middleName":"E.","lastName":"Woo","suffix":""},{"id":558398823,"identity":"2bb592bd-be1b-44ae-9044-ed3ad5bfbb80","order_by":7,"name":"Paul J. Lukac","email":"","orcid":"","institution":"UCLA Health Information Technology, UCLA Health, University of California, Los Angeles","correspondingAuthor":false,"prefix":"","firstName":"Paul","middleName":"J.","lastName":"Lukac","suffix":""}],"badges":[],"createdAt":"2025-10-21 00:53:15","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7909272/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7909272/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":97988194,"identity":"c9618b8a-69e5-4c87-881d-15bd8d8cf13e","added_by":"auto","created_at":"2025-12-11 14:10:00","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":185698,"visible":true,"origin":"","legend":"","description":"","filename":"PoeticorProsaicEvaluatingtheLinguisticQualityofAIGeneratedDraftRepliestoPatientPortalMessages.docx","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/f4842755550310b7c07c04dd.docx"},{"id":97988190,"identity":"370989b2-e90d-42f2-81c6-71b49ec491df","added_by":"auto","created_at":"2025-12-11 14:10:00","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":9771,"visible":true,"origin":"","legend":"","description":"","filename":"2d2887d58bc744cf8d46bf2cb95d7d3d.json","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/68cc660d1910897253107180.json"},{"id":97988186,"identity":"daf8a409-9ada-4625-8418-65bde553fc22","added_by":"auto","created_at":"2025-12-11 14:10:00","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":103030,"visible":true,"origin":"","legend":"","description":"","filename":"2d2887d58bc744cf8d46bf2cb95d7d3d1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/f9615b916964dfbcdf83d545.xml"},{"id":97988188,"identity":"a1b61951-f9c5-4264-868d-914c96a6b1af","added_by":"auto","created_at":"2025-12-11 14:10:00","extension":"jpeg","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":346313,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/ec48bb6cfc12cab9e5933162.jpeg"},{"id":97988196,"identity":"dc16720a-65cc-4551-ab19-47cb5edd3f06","added_by":"auto","created_at":"2025-12-11 14:10:01","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":69286,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/5a0ebd64334e9f4ce923f833.png"},{"id":97988198,"identity":"b34e07aa-eae0-4a5c-8be0-bdba7f852b67","added_by":"auto","created_at":"2025-12-11 14:10:01","extension":"xml","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":102180,"visible":true,"origin":"","legend":"","description":"","filename":"2d2887d58bc744cf8d46bf2cb95d7d3d1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/87b57c5337ff4cf8f5943bde.xml"},{"id":97988195,"identity":"2c2e545c-41ac-4d21-919f-58312f1d1c1d","added_by":"auto","created_at":"2025-12-11 14:10:01","extension":"html","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":117617,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/11a871967f9fbe22c7f7499a.html"},{"id":98424275,"identity":"e7055597-31c3-4a2c-b928-e87992837607","added_by":"auto","created_at":"2025-12-17 16:33:07","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":96542,"visible":true,"origin":"","legend":"\u003cp\u003eExample Epic electronic health record system prompt used to generate draft replies\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/f9faf89b94c3cc3f19d6c43e.png"},{"id":97988189,"identity":"2fa0f51a-75aa-4dd2-9454-d2e239864ccc","added_by":"auto","created_at":"2025-12-11 14:10:00","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":118028,"visible":true,"origin":"","legend":"\u003cp\u003eMonthly usage rates of GPT-generated draft replies\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/15197126219d8ff9d1c426a4.png"},{"id":97988192,"identity":"3798c400-766a-45e3-aa53-6959f2f4c0e9","added_by":"auto","created_at":"2025-12-11 14:10:00","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":72890,"visible":true,"origin":"","legend":"\u003cp\u003eClinician survey items evaluating accuracy, tone, prompt changes, time savings, cognitive load, and overall helpfulness of GPT-generated draft replies\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/43167d4b8d1e3356615d4b49.png"},{"id":97988191,"identity":"d415eac8-6130-4557-ab2b-ce08b7198973","added_by":"auto","created_at":"2025-12-11 14:10:00","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":70088,"visible":true,"origin":"","legend":"\u003cp\u003ePhysician ratings of GPT-generated draft replies versus physician-finalized messages across BERTScore tertiles\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/d2f324d9ee6df096f4311365.png"},{"id":98774721,"identity":"d1790c32-bc64-4aee-9a79-2aa9f5b3c08f","added_by":"auto","created_at":"2025-12-22 12:12:03","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1392416,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7909272/v1/0fddb02e-a751-4bb9-a68f-41d24df3dea7.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Poetic or Prosaic? Evaluating the Linguistic Quality of AI-Generated Draft Replies to Patient Portal Messages","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe integration of artificial intelligence (AI) into healthcare has ushered in a new era of innovation, offering promising solutions to longstanding challenges faced by healthcare providers.[\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e] As healthcare systems digitized, electronic health records (EHRs) have become indispensable tools for managing patient information and facilitating communication.[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e] However, EHRs have increased administrative tasks, contributing to clinician burnout and patient care challenges.[4\u0026ndash;7\u003csup\u003e]\u003c/sup\u003e Patient portal messages, a primary contributor to increased \u0026lsquo;pajama time\u0026rsquo;, have risen substantially with the expansion of telehealth during the global pandemic.[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] In recent years, AI technologies have emerged as transformative tools, offering opportunities to streamline workflows and enhance communication efficiency.[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]\u003c/p\u003e\u003cp\u003eIn this context, the advent of genAI, utilizing large language models (LLMs) and natural language processing (NLP), presents a promising opportunity to alleviate the burden of administrative tasks and enhance the overall efficiency of healthcare delivery. Powered by transformer neural network architecture, LLMs have already demonstrated astonishing capabilities in healthcare. They can now complete documentation, such as prior authorizations, outperform humans on medical licensing exams and interpret electrocardiograms.[\u003cspan additionalcitationids=\"CR13\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] In non-clinical scenarios, ChatGPT responses have demonstrated superior empathy and better quality than physician responses to patient messages posted to a social media forum, and more \u0026ldquo;balanced, complete, empathetic, and helpful\u0026rdquo; counseling than widely known professional advice columnists.[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e, \u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]\u003c/p\u003e\u003cp\u003eHowever, early real-world experiences with Epic\u0026rsquo;s Augmented Response Technology have shown mixed results \u0026ndash; no significant time savings, but paradoxically, a subjective sense of relief from EHR charting burden and an appreciation for the potential value of the tool.[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]\u003c/p\u003e\u003cp\u003eGiven early reports of low usage and limited efficiency gains, but suggestions of qualitative value of GDRs, we sought to illuminate what constitutes an \u0026ldquo;acceptable\u0026rdquo; AI-generated message. We aimed to provide a comprehensive perspective on evaluating genAI tools in clinical practice. By integrating traditional NLP metrics with human evaluation, including physician surveys, stratified review, and end-user feedback, we sought to inform real-world implementation strategies.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eStudy Design\u003c/h2\u003e\u003cp\u003eUCLA Health was an early adopter of an AI-generated draft response (GDR) pilot tool. Starting September 27th, 2023, nine primary care physicians (PCPs) across six outpatient clinics were sequentially added as pilot users, and all had access by November 2023. Initial users were selected based on EHR aptitude (i.e., physician informaticists) and message volume. Since GDRs are generated for all PCPs in a specific clinic (whether made visible or not), the expansion users included the remaining physicians in each pilot clinic. On February 22, 2024, 21 expansion users were added; two later left practice. Education consisted of live and recorded webinars and tip sheets. Once activated, providers began receiving GDRs for online portal questions. All providers gave verbal consent. This study was reviewed by the UCLA Institutional Review Board (Office of the Human Research Protection Program) and deemed exempt under institutional policy IRB# 24-001342. All procedures were conducted in accordance with relevant guidelines and regulations, including the Declaration of Helsinki and its later amendments.\u003c/p\u003e\u003cp\u003eGDRs were generated if the message fell into one of four Epic-managed categories: Medication, Paperwork, Results, and General. Questioners had to be 18 or older and not a proxy (e.g., a caregiver). UCLA Health had the ability to draft and edit four separate prompts, each tailored to a category. We used Epic\u0026rsquo;s suggested starter prompts, offered to all participating systems. An example prompt is shown in \u003cb\u003eFig.\u0026nbsp;1\u003c/b\u003e. Four prompt edits occurred during the study period:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e10/02/23: \u0026ldquo;Do not include a signoff. Let the provider end the response.\u0026rdquo;\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e11/06/23: \u0026ldquo;Limit the response to a maximum of 100 words.\u0026rdquo;\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e01/28/24: Removed a SmartLink (native Epic function) mapped to patient specific appointment data and deleted text in the prompt referencing the word \u0026lsquo;appointment.\u0026rsquo;\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003e05/09/24: Added a SmartLink to include the text of the message \u0026lsquo;Subject\u0026rsquo; line in the prompt.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eWe analyzed LLM/NLP metrics, token count, and Flesch reading-ease across four phases defined by prompt changes (Phases 1\u0026ndash;4). This observational study was deemed exempt from UCLA IRB review.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eData Source and Characteristics\u003c/h3\u003e\n\u003cp\u003eData included the patient message, the unedited GDR, and the final physician message sent to the patient. Due to tool design, GDRs were generated for all physicians in a clinic, regardless of individual activation. We extracted a dataset of GDRs not shown to providers, representing the \u0026lsquo;baseline\u0026rsquo; group. In this group, GPT and physician responses were independently generated for the same message, allowing for a comparison between unsupervised, unedited AI drafts and physician-written messages. A separate \u0026lsquo;intervention\u0026rsquo; group included GDRs that were shown, edited, and sent by providers.\u003c/p\u003e\n\u003ch3\u003eData Pre-processing, Cleaning, and Manipulation\u003c/h3\u003e\n\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003eText Pre-Processing\u003c/h2\u003e\u003cp\u003eWe applied text pre-processing steps from the Python Natural Language Toolkit (\u003cem\u003enltk\u003c/em\u003e library) and custom functions. For bag-of-words and cosine similarity, we split the messages up into cleaned tokens using the following steps: removal of common signatures and rich text; expansion of contracted words (e.g., I\u0026rsquo;ll\u0026thinsp;\u0026gt;\u0026thinsp;I will); conversion to lowercase; and removal of punctuation and stop words. Lemmatization was not used, as word forms were relevant for comparison. We calculated the percentage of GDR tokens used and the percentage of final messages composed of GDR tokens. Cosine similarity was computed by vectorizing the tokens using the term frequency-inverse document frequency (TF-IDF) vectorizer from the \u003cem\u003esklearn\u003c/em\u003e Python library.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eMetric Calculation\u003c/h3\u003e\n\u003cp\u003eFor calculating the BLEU, ROUGE and BERTScore, we applied basic cleaning (removal of signatures and rich text) and no other pre-processing. We derived the BLEU and ROUGE scores using the Python \u003cem\u003eevaluate\u003c/em\u003e library, and BERTScore precision, recall and F1 metrics using the \u003cem\u003ebert_score\u003c/em\u003e library. For Flesch-Kincaid readability, we used basic cleaning and calculated average syllables per word and words per sentence using nltk functions.\u003c/p\u003e\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003eMessage Comparison Metrics\u003c/h2\u003e\u003cp\u003eTo assess lexical and syntactic overlap between GDRs and final patient messages, we used a combination of traditional NLP and LLM-based metrics: cosine similarity, BLEU, ROUGE, and BERTScore.\u003c/p\u003e\u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eCosine similarity with term frequency-inverse document frequency (TF-IDF) analyzes and weighs individual terms within texts and can be used to highlight the importance of a word in a document relative to a collection of documents in the source data.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eBLEU (Bilingual Evaluation Understudy) prioritizes precision by evaluating how many n-grams (common word strings) from the GDR appear in the final response.[\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap of n-grams and the longest common subsequences between the GDR and the final response, and quantifies how much of the GDR was retained for the final response.[\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eBERTScore assesses semantic similarity between the generated and reference texts.[\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] This metric offers insight into the alignment of the core meaning between GDRs and physician responses.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eReadability\u003c/h3\u003e\n\u003cp\u003eWe compared the readability of the GDRs against the physician-sent messages using Flesch reading-ease test and length by calculating average token count.[\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]\u003c/p\u003e\n\u003ch3\u003eQualitative Evaluations\u003c/h3\u003e\n\u003cp\u003eWe performed two qualitative evaluations for this study. First, we distributed a user survey (Supplement 3) with 5-point Likert scale questions and a Net Promoter Score to gauge subjective impressions of the AI tool.[\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]\u003c/p\u003e\u003cp\u003e Second, we evaluated how well BERTScores reflect semantic alignment and message quality using physician review. Three board-certified physicians assessed 120 randomized message pairs, 60 with GPT-generated drafts (intervention) and 60 without (baseline), sampled from BERTScore tertiles. Each pair included an unedited GDR and the corresponding final provider message (either a physician-edited GDR or fully physician-composed response). Reviewers, blinded to group, rated each pair on a 5-point scale across four criteria:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eEffectiveness of the GPT draft in answering the patient\u0026rsquo;s question.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003ePotential for misunderstanding.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003ePotential for harm.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eOverall preference for the GPT-generated draft versus the physician's final message.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003eStatistical Analysis\u003c/h2\u003e\u003cp\u003eFor each metric, we report mean scores by group (baseline vs. intervention) and by intervention phase, with 95% confidence intervals. Estimates were obtained using linear regression with standard errors clustered at the provider level. For the qualitative evaluation of BERTScores and physician ratings, we computed Spearman correlation coefficients to assess the relationship between semantic similarity and physician Likert ratings. Statistical significance was defined as two-sided \u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.05.\u003c/p\u003e\u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003eOverview of GDR Usage\u003c/h2\u003e\u003cp\u003eFrom September 27, 2023, to August 4, 2024, 66,200 GDRs were generated. The baseline group (unseen drafts) included 45,127 messages. The intervention group (GDRs shown to providers) totaled 21,073, of which 2,264 (11%) were used in final responses. Trends in GDR generation and usage over time are shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e\u003cp\u003e\u003cb\u003eIntervention: Used GPT-generated Draft Responses (GDRs)\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cem\u003e1. LLM Metrics\u003c/em\u003e\u003c/p\u003e\u003cp\u003eIn the intervention group, a substantial proportion of AI-generated tokens were retained in the final physician-sent messages, with an average token retention rate of 62.22% (95% CI, 55.97\u0026ndash;68.46). Cosine similarity ratio was 0.74 (0.71\u0026ndash;0.77). BLEU and ROUGE scores reflected consistent lexical overlap (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). BERTScore values were uniformly high, with F1 score 0.93 (0.91\u0026ndash;0.94) suggesting strong semantic similarity between AI-generated drafts and the finalized messages.\u003c/p\u003e\u003cp\u003e\u003cem\u003e2. Token Count\u003c/em\u003e\u003c/p\u003e\u003cp\u003eThe final physician-edited messages were consistently longer compared to the initial GPT-draft responses (GDRs). GDRs averaged 32.56 tokens (30.14\u0026ndash;34.98), whereas final sent messages contained 51.96 tokens (21.97\u0026ndash;81.95), reflecting a 59.5% increase in length following physician edits.\u003c/p\u003e\u003cp\u003e\u003cem\u003e3. Flesch Reading-Ease Analysis\u003c/em\u003e\u003c/p\u003e\u003cp\u003eThe average readability score of GPT-generated drafts was 41.42 (40.85\u0026ndash;42.00). After physician review and editing, final messages had improved readability, with an average Flesch Reading Ease score of 45.51 (43.91\u0026ndash;47.10). Despite the improvement, both draft and final messages remained in the college-level readability range.\u003c/p\u003e\u003cp\u003e\u003cb\u003eBaseline: Unseen \u0026amp; Unused GPT-generated Draft Responses (GDRs)\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cem\u003e1. LLM Metrics\u003c/em\u003e\u003c/p\u003e\u003cp\u003e In this group, both GDR and the physician-written responses were independently crafted in response to an identical patient message. Thus, as expected, the percentage of AI-generated tokens in the final messages was significantly lower. Cosine similarity ratio was 0.27. BLEU was 0, while ROUGE1 (0.17), ROUGE2 (0.02), and ROUGE-L (0.12) likewise remained low. BERTScore F1 was moderate at 0.83.\u003c/p\u003e\u003cp\u003e\u003cem\u003e2. Token Count\u003c/em\u003e\u003c/p\u003e\u003cp\u003eThe average GDR token count was 29.24 (28.24\u0026ndash;30.24) and the average token count in the final physician-sent messages was 31.86 (26.20-37.52). The GPT-generated responses were similar in length to physician independently written messages.\u003c/p\u003e\u003cp\u003e\u003cem\u003e3. Flesch Reading Ease\u003c/em\u003e\u003c/p\u003e\u003cp\u003eThe average Flesch reading ease score for GDRs was 41.76 (41.23\u0026ndash;42.28). The scores for the final sent messages were higher, with an average score of 46.70 (42.02\u0026ndash;51.37).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003eIntervention vs. Baseline Statistical Comparisons\u003c/h2\u003e\u003cp\u003eSeveral LLM metrics showed statistically significant differences between groups (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. The BLEU score differed by 0.49 (0.42\u0026ndash;0.55, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), and ROUGE scores also showed notable differences, reflecting lexical and structural alignment in the intervention group compared to baseline. Measures of semantic similarity, including BERTScore precision, recall, and F1, were each higher by approximately 0.10 points (all p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), indicating greater semantic alignment in the intervention group.\u003c/p\u003e\u003cp\u003eThe number of tokens in GPT-generated drafts (GDRs) was significantly higher in the intervention group, with a difference of 3.32 tokens (0.65\u0026ndash;5.99, p\u0026thinsp;=\u0026thinsp;0.015), suggesting modestly longer drafts when providers had access to GPT assistance. The final physician-sent messages were also longer in the intervention group by 20.10 tokens, but this difference did not reach statistical significance (\u0026ndash;9.86 to 50.07, p\u0026thinsp;=\u0026thinsp;0.189). In contrast, Flesch Reading Ease scores did not differ significantly between groups. For final sent messages, the difference was \u0026minus;\u0026thinsp;1.19 (-6.05 to 3.68, p\u0026thinsp;=\u0026thinsp;0.632), suggesting that while message content improved in structure and semantic fidelity, overall readability remained comparable between intervention and baseline.\u003c/p\u003e\u003cp\u003e\u003cb\u003eResults of Qualitative Evaluations\u003c/b\u003e\u003c/p\u003e\u003cp\u003e\u003cem\u003e1. Physician Survey Feedback\u003c/em\u003e\u003c/p\u003e\u003cp\u003eIn a survey of 16 clinicians, the GDR tool received mixed but generally positive feedback. Clinicians found the tool helpful overall, with a mean rating of 3.76 (95% CI: 3.07\u0026ndash;4.46) for being \u0026ldquo;more of a help than a burden\u0026rdquo; and offering a \u0026ldquo;time-saving benefit\u0026rdquo; (mean 3.53, 95% CI: 2.90\u0026ndash;4.16). Respondents also reported a reduction in cognitive load, with an average rating of 3.59 (2.88\u0026ndash;4.29).When rating how well the tool responded across message types, General messages were rated the highest (3.29, 2.64\u0026ndash;3.94), followed by Results (2.82, 2.19\u0026ndash;3.46), Paperwork (2.59, 1.96\u0026ndash;3.22), and Medication queries (2.29, 1.70\u0026ndash;2.89). Clinicians\u0026rsquo; likelihood of recommending the tool to a colleague had a Net Promoter Score (NPS) of +\u0026thinsp;6.3, indicating a slightly positive overall recommendation tendency (NPS calculated as the percentage of promoters minus the percentage of detractors). User-facing survey questions are shown in \u003cb\u003eFig.\u0026nbsp;3\u003c/b\u003e.\u003c/p\u003e\u003cp\u003e\u003cem\u003e2. Physician Evaluation by BERTScore Tertiles (\u003c/em\u003eTable\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e\u003cem\u003e)\u003c/em\u003e\u003c/p\u003e\u003cp\u003eIn the baseline group, GPT drafts were rated neutral in answering questions (mean, 2.97) with low to moderate potential for misunderstanding (2.25) and very low potential for harm (1.43), but there was a strong preference for physician-crafted messages (4.24). In the intervention group, there was still a slight preference for physician-edited messages (3.99), and GPT drafts received high ratings for answering questions and maintaining low risks of misunderstanding or harm. Survey questions are shown in \u003cb\u003eFig.\u0026nbsp;4\u003c/b\u003e.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003eCorrelation Analysis\u003c/h2\u003e\u003cp\u003eSpearman correlation showed strong positive relationships between BERTScores and physician Likert ratings in the intervention group across all metrics (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). In the baseline group, there were minimal to no correlation.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003eEvaluation of Prompt Changes\u003c/h2\u003e\u003cp\u003eWe conducted a stratified analysis of the intervention group by prompt phase to assess changes following iterative prompt modifications. All statistical comparisons were clustered by provider.\u003c/p\u003e\u003cp\u003eBLEU, ROUGE-1, ROUGE-2, and ROUGE-L remained stable across phases, with no statistically significant differences. Cosine similarity showed a modest decline in Phase 4 compared to Phase 1. BERTScore F1 remained consistent across phases, with no significant differences observed.\u003c/p\u003e\u003cp\u003eDraft message length declined significantly after Phase 1 (mean: 52.60 tokens), when the prompt was adjusted to produce a shorter length. Compared to Phase 1, the reduction in draft length was statistically significant in all subsequent phases (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). Final message length also declined from Phase 1 (71.06 tokens), with a significant drop in Phase 2 (Δ: \u0026minus;\u0026thinsp;29.76, \u0026minus;\u0026thinsp;57.89 to \u0026minus;\u0026thinsp;1.63, \u003cem\u003ep\u003c/em\u003e\u0026thinsp;=\u0026thinsp;0.038); later phases showed no significant difference. In all phases, final physician-sent messages were consistently longer than the GPT-generated drafts, indicating that physicians tended to expand upon the initial AI-generated content. Readability scores (Flesch Reading Ease) for both drafts and final messages were stable across phases, with no significant changes.\u003c/p\u003e\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003e This study presents a comprehensive evaluation of GPT-generated draft replies in patient care communication, offering one of the first real-world, multi-angle assessments of their semantic fidelity, readability, and perceived utility. By combining quantitative LLM metrics, human validation, and usage data, our findings offer highlight both the potential and challenges of integrating artificial intelligence (AI) into clinical messaging.\u003c/p\u003e\u003cp\u003eIn the intervention group where GDRs were used, we observed high lexical, syntactic, and semantic alignment between GDRs and final physician-edited messages. BERTScores consistently exceeded 0.9, suggesting strong semantic overlap and effective capture of core clinical intent. This may indicate that GDRs provided a solid communication foundation, reducing the need for major revisions. However, such alignment may also reflect automation bias, with clinicians potentially deferring to AI-generated content, raising concerns about over-reliance.\u003c/p\u003e\u003cp\u003eAnalysis by BERTScore tertiles revealed that alignment between BERTScores and physician survey ratings was only meaningful when drafts were used. In the intervention group, higher BERTScores correlated with favorable Likert ratings, including effectiveness, clarity, and low risk of harm. In contrast, the baseline group showed minimal correlation, underscoring that BERTScores primarily reflect similarity, not intrinsic quality. GPT drafts were independently rated to have low to moderate potential for misunderstanding and very low potential for harm across both groups. Still, a clear preference for physician-written or -edited messages underscores the continued importance of maintaining a human touch, even as AI aims to enhance efficiency and reduce burden.\u003c/p\u003e\u003cp\u003eOur study also highlighted the complementary role of AI in crafting messages. When providers wrote their own message independently, the average token length was similar to the draft. However, when GDRs were used, final message lengths were consistently longer, suggesting that providers added clinical details or stylistic changes. While we did not survey perceived empathy of the messages, prior research has shown congruency between longer messages and increased perceived empathy.[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] This finding underscores the importance of viewing GDRs as a collaborative tool. GDRs served as a valuable starting point that providers could refine and personalize.\u003c/p\u003e\u003cp\u003eAreas for improvement remain. Both GPT drafts and physician sent messages consistently scored at \u0026ldquo;college-level\u0026rdquo; readability, which is well above the 6th -grade reading level suggested for patient education materials.[\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] Future iterations should incorporate prompt engineering prioritizing this to better serve patients with varying health literacy.\u003c/p\u003e\u003cp\u003eIn addition to improving readability, prompt engineering can also enhance the quality, tone, and acceptance of messages. For example, Yan, et. al., demonstrated that iterative prompt adjustments, informed by provider and patient feedback, improved message quality and tone.[\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e] In our implementation, iterative prompt updates shortened the message length. Targeted prompt modifications may improve physician uptake and integration of AI-generated content into their workflow.\u003c/p\u003e\u003cp\u003eDespite promising findings, our results revealed a steady decline in GDR usage rates throughout the study. While usage declined, provider feedback revealed high levels of satisfaction and perceived benefits among healthcare providers. Surveyed providers reported a sense of time savings and reduced cognitive load and found the tool more helpful than burdensome. The Net Promoter Score average of 5.88 support a somewhat positive reception. Our findings align with those reported in a study by Garcia et al., which noted significant reductions in physician task load and work exhaustion scores with implementation of AI-generated draft replies.[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]\u003c/p\u003e\u003cp\u003eThe discrepancy between perceived utility and actual usage highlights the complex nature of technology adoption in healthcare settings. These findings align with an American Medical Association (AMA) survey, where 65% of physicians viewed AI as advantageous to patient care, yet only 21% actively used it.[\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e] This gap between perceived potential and actual implementation underscores the need for strategies to improve adoption rates and integration into clinical workflows. Addressing barriers to adoption, such as enhancing workflow integration, increasing AI literacy, improving tone and readability of genAI text, and personalizing AI outputs, will be critical to bridging this gap and ensuring sustained use of these tools.\u003c/p\u003e\u003cp\u003eFurthermore, evaluating genAI implementations remains challenging due to reliance on human review, which is both resource- and time-intensive.[\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] Our study sought to address this by combining quantitative LLM metrics with qualitative provider evaluations to provide a comprehensive and intuitive framework for comparing GDRs and final responses. While BERTScore is a common proxy for semantic similarity, it has limitations in healthcare-specific contexts, where clinical nuance is critical. Emerging frameworks like \u003cem\u003eMedHELM\u003c/em\u003e and \u003cem\u003ePDSQI-9\u003c/em\u003e emphasize the need for structured, clinician-aligned evaluations that capture clinical relevance, safety, and patient-centered communication. These are areas where LLM metrics alone fall short.[\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e, \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e] Our use of clinician Likert ratings for assessment of semantic fidelity, risk, and helpfulness aligns with this emerging consensus. These frameworks underscore the growing recognition that human-centered, task-specific evaluations are essential for robust assessment of LLMs in clinical workflows.\u003c/p\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003eLimitations\u003c/h2\u003e\u003cp\u003eWhile we evaluated this AI tool with quantitative LLM metrics and qualitative surveys, limitations remain. Due to technical limitations, we could not evaluate drafts that may have been seen by a provider but discarded, and we were unable to definitively ascertain whether a draft was seen by a provider. This single-center study included only physicians, though many in-basket workflows involve non-physician clinical workers. Thus, findings may not reflect broader team-based use. Future studies should be multi-center and include diverse specialties and roles. Lastly, our study evaluated GPT-4 as implemented in Epic\u0026rsquo;s In Basket Art, future GPT iterations may produce different results and should be reassessed in a future study.\u003c/p\u003e\u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eGPT-generated draft replies (GDRs) demonstrate strong semantic alignment with physician-edited messages and offer a promising foundation for enhancing patient-provider communication. Our findings highlight both the utility and limitations of LLM metrics such as BERTScore, which aligned with physician judgment only when GDRs were actively used. However, by combining quantitative metrics with structured human evaluation, this study reflects evolving potential best practices for LLM assessment in healthcare, and provides a pragmatic framework for health systems to assess, monitor, and refine AI-powered patient messaging tools.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eCompeting Interests\u003c/h2\u003e\n\u003cp\u003eAll authors declare no financial or non-financial competing interests.\u003c/p\u003e\n\u003ch2\u003eHuman Ethics and Consent to Participate\u003c/h2\u003e\n\u003cp\u003eHuman Ethics and Consent to Participate declarations: not applicable.\u003c/p\u003e\n\u003ch2\u003eClinical Trial Number\u003c/h2\u003e\n\u003cp\u003eNot Applicable\u003c/p\u003e\n\u003ch2\u003eFunding\u003c/h2\u003e\n\u003cp\u003eThis study received no funding.\u003c/p\u003e\n\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\n\u003cp\u003eT.M., P.J.L., L.P., and G.H. conceived and designed the study. S.S.V. performed the primary statistical analysis, with additional statistical input from L.P. L.P. also conducted the natural language processing analyses, including BERTScore, BLEU, and related metrics. S.M.Y., H.E.W., and J.K. served as provider reviewers for the survey data. All authors contributed to the interpretation of results, revised the manuscript critically for important intellectual content, and approved the final version for submission.\u003c/p\u003e\n\u003ch2\u003eAcknowledgement\u003c/h2\u003e\n\u003cp\u003eThe authors thank the UCLA Health Clinical Informatics team and participating physicians for their support of this project.\u003c/p\u003e\n\u003ch2\u003eData Availability\u003c/h2\u003e\n\u003cp\u003eData supporting the findings of this study are available from UCLA Health but restrictions apply to protect patient privacy. De-identified data may be available from the corresponding author upon reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eTopol, E. J. High-performance medicine: The convergence of human and artificial intelligence. \u003cem\u003eNat. Med.\u003c/em\u003e 25, 44\u0026ndash;56 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYu, K. H., Beam, A. L. \u0026amp; Kohane, I. S. Artificial intelligence in healthcare. \u003cem\u003eNat. Biomed. Eng.\u003c/em\u003e 2, 719\u0026ndash;731 (2018).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAdler-Milstein, J. \u003cem\u003eet al.\u003c/em\u003e Electronic health record adoption in US hospitals: The emergence of a digital \"advanced use\" divide. \u003cem\u003eJ. Am. Med. Inform. Assoc.\u003c/em\u003e 24, 1142\u0026ndash;1148 (2017).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eShanafelt, T. D., Dyrbye, L. N. \u0026amp; West, C. P. Addressing physician burnout: The way forward. \u003cem\u003eJAMA\u003c/em\u003e 317, 901\u0026ndash;902 (2017).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinsky, C. \u003cem\u003eet al.\u003c/em\u003e Allocation of physician time in ambulatory practice: A time and motion study in 4 specialties. \u003cem\u003eAnn. Intern. Med.\u003c/em\u003e 165, 753\u0026ndash;760 (2016).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMcPeek-Hinz, E. \u003cem\u003eet al.\u003c/em\u003e Clinician burnout associated with sex, clinician type, work culture, and use of electronic health records. \u003cem\u003eJAMA Netw. Open\u003c/em\u003e 4, e2114066 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eRotenstein, L. S. \u003cem\u003eet al.\u003c/em\u003e Differences in clinician electronic health record use across adult and pediatric primary care specialties. \u003cem\u003eJAMA Netw. Open\u003c/em\u003e 4, e2117244 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAkbar, F. \u003cem\u003eet al.\u003c/em\u003e Physicians\u0026rsquo; electronic inbox work patterns and factors associated with high inbox work duration. \u003cem\u003eJ. Am. Med. Inform. Assoc.\u003c/em\u003e 28, 923\u0026ndash;930 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eNath, B. \u003cem\u003eet al.\u003c/em\u003e Trends in electronic health record inbox messaging during the COVID-19 pandemic in an ambulatory practice network in New England. \u003cem\u003eJAMA Netw. Open\u003c/em\u003e 4, e2131490 (2021).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMaddox, T. M., Rumsfeld, J. S. \u0026amp; Payne, P. R. O. Questions for artificial intelligence in health care. \u003cem\u003eJAMA\u003c/em\u003e 321, 31\u0026ndash;32 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDavenport, T. \u0026amp; Kalakota, R. The potential for artificial intelligence in healthcare. \u003cem\u003eFuture Healthc. J.\u003c/em\u003e 6, 94\u0026ndash;98 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDiane, A., Gencarelli, P. J. Jr, Lee, J. M. Jr \u0026amp; Mittal, R. Utilizing ChatGPT to streamline the generation of prior authorization letters and enhance clerical workflow in orthopedic surgery practice: A case report. \u003cem\u003eCureus\u003c/em\u003e 15, e49680 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBrin, D. \u003cem\u003eet al.\u003c/em\u003e Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. \u003cem\u003eSci. Rep.\u003c/em\u003e 13, 16492 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eG\u0026uuml;nay, S., \u0026Ouml;zt\u0026uuml;rk, A., \u0026Ouml;zerol, H. \u003cem\u003eet al.\u003c/em\u003e Comparison of emergency medicine specialist, cardiologist, and ChatGPT in electrocardiography assessment. \u003cem\u003eAm. J. Emerg. Med.\u003c/em\u003e 80, 51\u0026ndash;60 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAyers, J. W. \u003cem\u003eet al.\u003c/em\u003e Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. \u003cem\u003eJAMA Intern. Med.\u003c/em\u003e 183, 589\u0026ndash;596 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHowe, P. D. L. \u003cem\u003eet al.\u003c/em\u003e ChatGPT\u0026rsquo;s advice is perceived as better than that of professional advice columnists. \u003cem\u003eFront. Psychol.\u003c/em\u003e 14, 1281255 (2023).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGarcia, P. \u003cem\u003eet al.\u003c/em\u003e Artificial intelligence-generated draft replies to patient inbox messages. \u003cem\u003eJAMA Netw. Open\u003c/em\u003e 7, e243201 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eTai-Seale, M. \u003cem\u003eet al.\u003c/em\u003e AI-generated draft replies integrated into health records and physicians\u0026rsquo; electronic communication. \u003cem\u003eJAMA Netw. Open\u003c/em\u003e 7, e246565 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePapineni, K., Roukos, S., Ward, T. \u0026amp; Zhu, W.-J. BLEU: A method for automatic evaluation of machine translation. In Proc. 40th Annu. Meet. Assoc. Comput. Linguist. 311\u0026ndash;318 (ACL, 2002).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proc. ACL Workshop 74\u0026ndash;81 (ACL, 2004).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, T., Kishore, V., Wu, F., Weinberger, K. Q. \u0026amp; Artzi, Y. BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFlesch, R. A new readability yardstick. \u003cem\u003eJ. Appl. Psychol.\u003c/em\u003e 32, 221\u0026ndash;233 (1948).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAdams, C. \u003cem\u003eet al.\u003c/em\u003e The ultimate question? Evaluating the use of Net Promoter Score in healthcare: A systematic review. \u003cem\u003eHealth Expect.\u003c/em\u003e 25, 2328\u0026ndash;2339 (2022).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eCotugna, N., Vickery, C. E. \u0026amp; Carpenter-Haefele, K. M. Evaluation of literacy level of patient education pages in health-related journals. \u003cem\u003eJ. Community Health\u003c/em\u003e 30, 213\u0026ndash;219 (2005).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYan, S. \u003cem\u003eet al.\u003c/em\u003e Prompt engineering on leveraging large language models in generating response to InBasket messages. \u003cem\u003eJ. Am. Med. Inform. Assoc.\u003c/em\u003e 31, 2263\u0026ndash;2270 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAMA Augmented Intelligence Research. American Medical Association (2023). (Accessed 31 May 2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBedi, S. et al. Testing and evaluation of health care applications of large language models: A systematic review. JAMA preprint (2024). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1101/2024.04.15.24305869\u003c/span\u003e\u003cspan address=\"10.1101/2024.04.15.24305869\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eAwasthi, R. et al. HumanELY: Human evaluation of LLM yield using a novel web-based evaluation tool. medRxiv preprint (2023). doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1101/2023.12.22.23300458v2\u003c/span\u003e\u003cspan address=\"10.1101/2023.12.22.23300458v2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGarcia, P. \u003cem\u003eet al.\u003c/em\u003e MedHELM: A human evaluation benchmark for large language models in medicine. \u003cem\u003eNPJ Digit. Med.\u003c/em\u003e 7, 113 (2024).\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eZhang, Y. et al. PDSQI-9: Clinician-aligned evaluation for patient-directed summaries. arXiv preprint arXiv:2505.23802 (2025).\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"},{"header":"Tables","content":"\u003cp\u003e\u003cstrong\u003eTable 1.\u0026nbsp;\u003c/strong\u003eNatural language processing metrics comparing GPT-generated draft replies and final physician messages\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"759\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eBaseline\u003c/strong\u003e\u003cstrong\u003e: Mean (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eIntervention: Mean (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep\u003c/strong\u003e\u003cstrong\u003e-values\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDraft Token Length\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e29.24 (28.24\u0026ndash;30.23)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e32.56 (30.14\u0026ndash;34.98)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.015\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFinal Message Token Length\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e31.86 (26.20\u0026ndash;37.52)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e51.96 (21.97\u0026ndash;81.95)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.189\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDraft Tokens Retained (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e13.61 (12.58\u0026ndash;14.63)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e62.22 (55.97\u0026ndash;68.46)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCosine Similarity\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e0.27 (0.25\u0026ndash;0.28)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.74 (0.71\u0026ndash;0.77)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eBLEU Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e0.00 (0.00\u0026ndash;0.00)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.49 (0.43\u0026ndash;0.56)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eROUGE1 Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e0.17 (0.17\u0026ndash;0.18)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.62 (0.57\u0026ndash;0.68)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eROUGE2 Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e0.02 (0.02\u0026ndash;0.03)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.55 (0.49\u0026ndash;0.61)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eROUGE-L Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e0.12 (0.11\u0026ndash;0.12)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.60 (0.54\u0026ndash;0.66)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eBERTScore F1\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e0.83 (0.83\u0026ndash;0.84)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.93 (0.91\u0026ndash;0.94)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFlesch-Kincaid Reading Ease \u0026ndash; Draft\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e41.76 (41.23\u0026ndash;42.28)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e41.42 (40.85\u0026ndash;42.00)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.393\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 23.9789%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFlesch-Kincaid Reading Ease \u0026ndash; Final Message\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 24.3742%;\"\u003e\n \u003cp\u003e46.70 (42.02\u0026ndash;51.37)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e45.51 (43.91\u0026ndash;47.10)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.8235%;\"\u003e\n \u003cp\u003e0.632\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2.\u0026nbsp;\u003c/strong\u003ePhysician survey results stratified by BERTScore tertiles in baseline and intervention groups\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"3\" cellpadding=\"0\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e\u003cstrong\u003eScale\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eBaseline Group\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;(Mean [95% CI])\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eIntervention Group\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;(Mean [95% CI])\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eHow well GPT answered the question\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e1 = Very ineffective, 5 = Very effective\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.97 (2.69\u0026ndash;3.24) (Neutral)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e3.95\u003c/strong\u003e\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e(95% CI\u003cstrong\u003e:\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003e3.75\u0026ndash;4.15\u003c/strong\u003e) \u003cem\u003e(Effective)\u003c/em\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePotential for misunderstanding\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e1 = No potential for misunderstanding, 5 = Very high potential for misunderstanding\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.25 (2.02\u0026ndash;2.48) (Low to Moderate)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026nbsp;\u003cstrong\u003e1.45\u003c/strong\u003e (95% CI: \u003cstrong\u003e1.31\u0026ndash;1.58\u003c/strong\u003e) \u003cem\u003e(Very Low)\u003c/em\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePotential for harm\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e1 = No potential for harm, 5 = Very high potential for harm\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e1.43 (1.29\u0026ndash;1.58) (Very Low)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e1.21\u003c/strong\u003e (95% CI: \u003cstrong\u003e1.11\u0026ndash;1.30\u003c/strong\u003e) \u003cem\u003e(Very Low)\u003c/em\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePreference for GPT or doctor\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\"\u003e\n \u003cp\u003e1 = Strongly prefer GPT-generated messages, 5 = Strongly prefer doctor\u0026rsquo;s final messages\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.24 (4.03\u0026ndash;4.45) (Strong Preference for Doctor)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e3.99\u003c/strong\u003e (95% CI: \u003cstrong\u003e3.81\u0026ndash;4.17\u003c/strong\u003e) \u003cem\u003e(Slight Preference for Doctor)\u003c/em\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 3.\u0026nbsp;\u003c/strong\u003eNatural language processing metrics and message length across intervention prompt refinement phases (Phases 1\u0026ndash;4)\u003c/p\u003e\n\u003ctable border=\"0\" cellspacing=\"0\" cellpadding=\"0\" width=\"759\" class=\"fr-table-selection-hover\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMetric\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePhase 1: Mean (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePhase 2: Mean (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePhase 3: Mean (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ePhase 4: Mean (95% CI)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 87px;\"\u003e\n \u003cp\u003e\u003cstrong\u003ep-values\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDraft Token Length\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e52.60 (48.94\u0026ndash;56.26)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e32.84 (30.93\u0026ndash;34.75)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e29.87 (29.25\u0026ndash;30.49)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e29.46 (28.13\u0026ndash;30.78)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 87px;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFinal Message Token Length\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e71.06 (23.90\u0026ndash;118.22)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e41.30 (23.35\u0026ndash;59.25)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e53.02 (20.75\u0026ndash;85.28)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e48.68 (17.86\u0026ndash;79.50)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 87px;\"\u003e\n \u003cp\u003e\u0026lt;0.001\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eDraft Tokens Retained (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e67.64 (59.38\u0026ndash;75.89)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e64.59 (57.43\u0026ndash;71.74)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e61.35 (53.11\u0026ndash;69.59)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e60.25 (54.45\u0026ndash;66.06)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.026\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eCosine Similarity\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.71 (0.65\u0026ndash;0.77)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.75 (0.71\u0026ndash;0.80)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e0.75 (0.72\u0026ndash;0.79)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e0.73 (0.71\u0026ndash;0.75)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.038\u003c/p\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eBLEU Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.48 (0.39\u0026ndash;0.57)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.51 (0.44\u0026ndash;0.59)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e0.49 (0.41\u0026ndash;0.58)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e0.48 (0.42\u0026ndash;0.54)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.982\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eROUGE1 Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.62 (0.54\u0026ndash;0.70)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.64 (0.58\u0026ndash;0.70)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e0.63 (0.56\u0026ndash;0.69)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e0.61 (0.56\u0026ndash;0.66)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.651\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eROUGE2 Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.54 (0.47\u0026ndash;0.62)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.57 (0.50\u0026ndash;0.64)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e0.56 (0.48\u0026ndash;0.64)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e0.54 (0.48\u0026ndash;0.60)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.911\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eROUGE-L Score\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.59 (0.51\u0026ndash;0.67)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.61 (0.55\u0026ndash;0.68)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e0.60 (0.53\u0026ndash;0.67)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e0.58 (0.53\u0026ndash;0.64)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.916\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eBERTScore F1\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.92 (0.91\u0026ndash;0.94)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e0.93 (0.92\u0026ndash;0.94)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e0.93 (0.91\u0026ndash;0.94)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e0.92 (0.91\u0026ndash;0.93)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.979\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFlesch-Kincaid Reading Ease \u0026ndash; Draft\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e41.36 (39.33\u0026ndash;43.39)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e40.92 (39.37\u0026ndash;42.47)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e41.77 (41.22\u0026ndash;42.31)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e41.10 (40.36\u0026ndash;41.85)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.797\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"bottom\" style=\"width: 168px;\"\u003e\n \u003cp\u003e\u003cstrong\u003eFlesch-Kincaid Reading Ease \u0026ndash; Final\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e44.10 (40.26\u0026ndash;47.93)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 126px;\"\u003e\n \u003cp\u003e45.14 (43.83\u0026ndash;46.45)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 120px;\"\u003e\n \u003cp\u003e46.03 (44.22\u0026ndash;47.84)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 132px;\"\u003e\n \u003cp\u003e45.30 (43.04\u0026ndash;47.56)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"bottom\" style=\"width: 87px;\"\u003e\n \u003cp\u003e0.555\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Generative AI, Electronic health records, Clinical communication, Natural language processing","lastPublishedDoi":"10.21203/rs.3.rs-7909272/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7909272/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e\u003cp\u003eThe use of generative artificial intelligence (genAI) in healthcare is increasing, including the use of GPT-generated draft replies (GDRs) to patient messages via Epic Systems\u0026rsquo; electronic health record (EHR). We evaluated GDR use, quality, and impact in a large academic health system.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e\u003cp\u003eThirty primary care physicians received GDRs from September 2023 to August 2024 during a staged rollout. Messages were grouped into baseline (GDRs not shown) and intervention (GDRs used). We evaluated messages using BLEU, ROUGE, cosine similarity, BERTScore, token counts and Flesch Reading Ease. We compared baseline and intervention groups, and across prompt refinement phases (Phases 2\u0026ndash;4 vs. Phase 1). Blinded evaluations of message quality were conducted via surveys, and BERTScores were correlated with physician evaluations on effectiveness, misunderstanding, and harm.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e\u003cp\u003eOf 66,200 GDRs generated, 21,073 were presented, and 2,264 (11%) were used. Used GDRs showed alignment with final messages [(BLEU 0.49 (95% CI: 0.43\u0026ndash;0.56), ROUGE-L 0.60 (0.54\u0026ndash;0.66)], with high BERTScores (F1\u0026thinsp;\u0026gt;\u0026thinsp;0.9). Final messages were longer and more readable. Prompt refinements increased token retention. GDR usage declined over time, yet providers reported time savings and reduced cognitive load. BERTScores correlated strongly with physician feedback on effectiveness and safety in the intervention group.\u003c/p\u003e\u003ch2\u003eConclusions\u003c/h2\u003e\u003cp\u003eGPT-generated drafts show strong semantic alignment with physician messages and may support efficient communication. However, usage trends and readability challenges underscore the need for improved prompt design and better workflow integration. Quantitative metrics like BERTScore, when paired with physician feedback, offer a scalable framework for evaluating AI-assisted messaging in healthcare.\u003c/p\u003e","manuscriptTitle":"Poetic or Prosaic? Evaluating the Linguistic Quality of AI-Generated Draft Replies to Patient Portal Messages","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-11 14:09:44","doi":"10.21203/rs.3.rs-7909272/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-12-21T18:10:05+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-19T22:16:32+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-11T20:41:01+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"316038610932037264818084908926068111991","date":"2025-12-10T15:44:10+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"138307134167382190688520884615756247697","date":"2025-12-08T18:18:18+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"66316785492980871118077586858975031951","date":"2025-12-08T14:53:17+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"259040898578856862872351764564736057313","date":"2025-12-08T14:53:09+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"295115372631635152650856242955911825098","date":"2025-12-08T14:37:01+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"118403349453656077802342408115942611297","date":"2025-12-08T14:16:48+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-12-08T14:09:17+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-10-28T14:16:12+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-10-28T04:56:31+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Digital Medicine","date":"2025-10-21T00:47:25+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"4e3a0696-e017-4178-b740-ff4cad941be2","owner":[],"postedDate":"December 11th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":59441907,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":59441908,"name":"Health sciences/Health care"},{"id":59441909,"name":"Physical sciences/Mathematics and computing"},{"id":59441910,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2026-05-01T17:08:42+00:00","versionOfRecord":[],"versionCreatedAt":"2025-12-11 14:09:44","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7909272","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7909272","identity":"rs-7909272","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.