Evaluating the Ethical and Clinical Implications of Generative AI in Patient-Centric Medical Applications | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Evaluating the Ethical and Clinical Implications of Generative AI in Patient-Centric Medical Applications zobia shabeer This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7666314/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Generative AI systems like GPT-4 and Med-PaLM 2 are steadily making their entrance into the medical practice in more areas, including documentation and direct contact with patients. Their linguistic fluency and knowledge representation has demonstrated potential but the ethical and clinical implication of implementing such systems in front Thailand of the patient has not been addressed. In this study, the researchers examine how generative AI tools perform on patient queries in a real-world setting, with the researcher looking at five critical dimensions, including clinical accuracy, hallucination frequency, demographic bias, empathy and trustworthiness, and ethical transparency. We have tested 100 standardized queries on GPT-4 and Med-PaLM 2. Responses were evaluated on the basis of expert judgment, demographic bias testing, empathy and readable metrics, human rater judgment. The findings indicate that GPT-4 compared more favorably to Med-PaLM 2 in their clinical accuracy (83% vs. 72%), empathy (3.9 vs. 3.1) and trust (4.2 vs. 3.6), though both generative AI models showed great weaknesses in transparency where they were disclaiming in less than 20 percent of cases, and do not reference credible sources. It is also noticeable that both systems exhibited significant degree of bias in the same when demographical variations were introduced, especially in the case of race and immigration differences. The observations made shed light on the necessity of high ethical standards, human supervision and Model-level auditing until generative AI can be reliably used in the clinical practice. In conclusion, we suggest moderating bias, increasing transparency and co-designing communication systems between AI and patients to be created with the emphasis on safety, empathy and trust. Artificial Intelligence and Machine Learning Generative AI Large Language Models (LLMs) Clinical Accuracy Bias and Fairness in AI Empathy and Trustworthiness Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 1. Introduction The recently developed Generative Artificial Intelligence (AI) and specifically, large language models (LLMs) including GPT-4, Med-PaLM, and LLaVA-Med, used to create human-like explanations, text and images in clinical and patient-related settings have changed the digital health landscape in a very short period of time [ 1 ], [ 2 ]. Such models are developed over transformer-based models that are trained using large-scale data, which make it possible to comprehend and respond to natural language instructions with more and more coherent and domain-relevant information [ 3 ]. Generative AIs are already being discussed in healthcare with a wide range of applications including the drafting of clinical notes, summary of patient histories, and even directly addressing patients through chatbots and digital assistants [ 4 ]. Generative AI has been introduced further into the medical field over the past year, with both technological advances and the clinical need to automate and assist causing the pace of its adoption to increase since 2022 [ 5 ]. As another example, ambient AI scribes can now transcribe clinical discussions with a high level of accuracy, which should be the documentation that physicians have less work to do and manage to care more about patients [ 6 ]. Likewise, Med-PaLM, as well as other LLMs based on medical question-answering fine-tuning, have been created and optimized towards this particular task and have been shown to perform on par with expert clinicians on standard datasets [ 7 ]. Such practical implementations point to the potential of generative AI in aiding the inefficiencies of healthcare systems, the enhancement of access to information, and clinical decision-making support. Nonetheless, concerns about safety, the possibility of bias, communicating on the rationale of the AI tool, and informed consent may also be seen as a growing issue as AI applications become more and more present in patient-facing applications [ 8 ], [ 9 ]. In contrast to conventional diagnostic support systems, generative AI models do not only retrieve information; they generate novel language, which may come with inaccuracies, or even with hallucinations, dangerous and false information being generated [ 10 ]. Just to get an example, recent researches have revealed that LLMs such as GPT-4 can readily give erroneous drug prescription or diagnostic recommendations that do not meet existing clinical standards [ 11 ]. Such risks are enhanced in an environment where patients deal with AI systems directly and without medical assistance. Additionally, the ethical issues of demographic bias and the recommendation of unequal treatment have been discovered in generative outputs especially when the difference about prompts is given in relation to race, gender, or social background [ 12 ]. Research has established that LLMs have the ability to transfer implicit biases contained within their training data which makes them discriminatory and goes against the principles of health equity [ 13 ]. Meanwhile, instances of informed consent are inadequately implemented, and the majority of AI-produced responses fail to provide proper information about them being machine-generated, along with risks and lack of answers within the given context [ 14 ]. Such omissions may erode patient confidence and call into question legality and ethics of autonomy, avoidability and responsibility. Although these issues are becoming increasingly contentious, as noted in most existing literature, not much attention has been given to the scope of technical performance of the generative AI models in most clinical applications -- accuracy, fluency, and the retrieval of information -- without paying attention to the bigger picture of the ethical implications of the technology and its impact on patients [ 15 ]. It is evident that there is no integrated consideration of the clinical reliability of output with the ethical soundness of generative AI outputs in the context of their deployment in real patient facing settings. The research seeks to address this urgent need by ethically and scientifically analyzing the generative AI technologies such as GPT-4 or Med-PaLM, focusing treating patients in a systematic manner. Namely, the study examines the efficiency of these tools concerning the medical accuracy, empathy, bias, readability, and transparency to the range of patient prompts. The paper also examines which ethical suitability and clinical safety gaps exist between AI-generation and humankind and content generated by humans. This study helps develop safer, more fair, and more reliable AI systems in healthcare, as it offers empirical evidence both in the technical and ethical aspect of the problem. The results will elicit the information on the possible dangers and potentials in developing generative AI to engage patients to AI researchers, healthcare operators, and regulators. 2. Literature Review This section offers a review of previous research on the use of Generative AI in healthcare, with a central focus on major themes, 2.1 Generative AI in Clinical Note Writing Generative AI is progressively applied in automating medical documentation and most notably in the writing of clinical notes. They can be thought of as ambient scribes that can read and understand patient-clinician dialogues by transcribing and summarising them into structured notes, limiting the documentation load placed on health practi-tioners. As an illustration, Feldman et al. tested an AI scribe in outpatient clinics and claimed to have increased doc-umentation efficiency and clinician satisfaction without the decrease of accuracy [ 16 ]. Nuance Dragon Ambient eXperience (DAX) developed by Microsoft and the MedLM models developed by Google are also being implemented in real-time transcription and electronic medical record (EMR) population [ 17 ]. Transformer-based large language models trained on biomedical corpora allow these systems to comprehend the terminologies and well-known Medical abbreviations and medical context. In a pilot study, documentation with generative AI received proved to be clearer and complete than manual input in terms of SOAP notes [ 18 ]. In spite of these developments, the notion of factual accuracy, excessive summarization and loss of context are still present, particularly in multi-turn conversations. As Liu et al. remarked, such notes produced by AI may fail to capture minor but clinically relevant facts, which poses safety concerns [ 19 ]. 2.2 Use in Diagnosis and Patient Query Response Generative AI has also been used to help diagnose and respond to patient queries, especially with models like Med-PaLM, GPT-4, and ChatGPT. Med-PaLM 2 reached performance close to that of humans on multiple-choice medical exams, including USMLE-style questions, in a landmark study [ 20 ]. Equally, the GPT-4 was evaluated on the datasets of patient queries and demonstrated that it could produce medically viable responses yet had significant hallucination levels [ 21 ]. Garcia et al. examined how GPT-4 could answer patient messages through the electronic health record (EHR) system. Their findings indicated that AI responses were often more grammatically correct and empathetic than those from physicians, but sometimes they were either too vague or overly confident, and they did not always cite reliable sources [ 22 ]. Despite the introduction of such models in digital health applications, chatbots, and patient portals, worries remain about the risk of misdiagnosis from suggestions made without verifying AI output. This risk is particularly high when these models lack specific training in a certain area or do not involve qualified experts [ 23 ]. 2.3 Bias and Equity Challenges in LLM Outputs Discrimination in generative AI, and healthcare in particular, is one of the most significant concerns. Obermeyer et al. unveiled structural differences in algorithmic clinical risk ratings. Such results have been evident in the productions of large language models [ 24 ]. Hwang et al. study revealed that working with the same prompts (differing on a demographic basis, e.g. by race or age), the suggestions made by GPT-4 varied significantly even when the verbal descriptions of the same symptoms were used [ 25 ]. This would probably be as a result of biases in the training sets that record prior disparities in care. Weidinger et al. stated that large language models are not sensitive enough to realize when their result reproduces harmful stereotypes thus becoming unethical during cases involving patients [ 26 ]. So as to address the following issues, we should conduct proactive bias audits, redesign prompts, and enhance model explicability, not simply balance data [ 27 ]. 2.4 Privacy, Security, and Informed Consent Limitations Privacy and consent are ethical concerns relevant to the topic of generative AI, particularly, in systems where the patients can interact with the AI. LLMs are taught with publicly accessible data and could unknowingly memorize or generate sensitive patient data. In their work, Lehman et al. demonstrated that indirect prompts allow LLMs to gen-erate patient-specific information even in case of de-identification of clinical datasets [ 28 ]. Also, the majority of the existing systems fail to disclose whether an AI model is used and what are its limitations. Moor et al. pointed to the importance of informed consent using digital means when patients have to interact with the AI systems with the ability of hallucinating or distorting the truth [ 29 ]. In absence of clear information, users may mix up AI outputs with professional medical directions. Although there are laws about data privacy such as the HIPAA at the level of the system, there are not many enforceable regulations about the conduct of the model itself. Such absence of regulation is especially troubling in patient-facing tools, which are not restricted to clinical practice [ 30 ]. 2.5 Evaluation of Real-World AI Interfaces in Medicine Though real-world evaluations of generative AI in healthcare are still limited, early studies provide useful insights. Mannhardt et al. conducted a randomized controlled trial testing AI-generated notes and found that while they im-proved patient understanding, trust in the notes was lower than for physician-generated text [ 31 ]. Liu et al. looked at how GPT-4-assisted messaging improved the structure and tone of patient queries sent to providers. However, they also found cases where patients relied too much on AI for medical decision-making [ 32 ]. Hager et al. assessed how well GPT-4 followed guidelines in response to clinical prompts and reported a 36% failure rate in following protocols. This highlights the need for human oversight and stronger regulations [ 33 ]. These findings point to the urgent need for evaluation frameworks that measure usability, ethical behavior, and trust—not just accuracy.Files should be in MS Word format only and should be formatted for direct printing. 3. Methodology In this work, a mixed-methods approach is employed to assess clinical performance and ethical reliability of GPT-4 and Med-PaLM generative AI models and other models that may be used in patient-facing medical applica-tions. The approach will evaluate both the quantifiable evaluation, such as precision, comprehensibility, and bias, as well as qualitative things, like empathy, moral suitability, and patient-trust. 3.1 Data Collection To simulate real-world usage, a dataset of 100 patient queries was curated from three primary sources: public patient forums such as Reddit Health and HealthBoards, frequently asked questions (FAQs) from reputable health infor-mation platforms like the Mayo Clinic and WebMD, and simulated patient scenarios derived from clinical case vi-gnettes used in medical education. Each query was standardized in format and categorized into common clinical domains, including cardiology, dermatology, mental health, infectious diseases, and women’s health. To evaluate potential demographic bias, a subset of the prompts was duplicated with variations in demographic descriptors—such as age, gender, and race—while maintaining identical clinical content. The overall research workflow is illustrated in Fig. 1 , which summarizes the sequential stages of the study. Beginning with the collection and standardization of 100 patient queries across multiple clinical domains, the queries were then provided to GPT-4 and Med-PaLM 2 under identical parameters. The responses were systematically gathered and subjected to evaluation using predefined criteria, followed by automated and manual analysis, and finally ethical considerations. This structured workflow ensured both methodological rigor and reproducibility in line with prior large-scale LLM evaluation frameworks [ 34 ]–[ 36 ]. 3.2 AI Models and Response Generation GPT-4, which could be accessed via OpenAI API, and Med-PaLM 2, the medically fine-tuned adaptation of PaLM-2 model created by Google Research, are the models selected to be utilized during this study. GPT-4 was a general domain large language model and could be medically prompted, whereas Med-PaLM 2 was developed to work on clinical tasks. All 100 standardized patient inquiries were fed into the two models through the same parameters with no variation in temperature and a standard system feedback such as, You are a great and precise medical assistant. All the responses were obtained as they were without any editing. The set of AI responses used consisted of 200 responses 100 responses per model. 3.3 Evaluation Criteria Five crucial criteria were used to assess the responses. There were two steps to Clinical Accuracy testing: (1) comparing AI-generated answer to a gold standard which is made up of expert-reviewed clinical guidelines including UpToDate, CDC, and NICE. The responses were rated by three medical experts (who gave the responses a Correct, Partially Correct, or Incorrect rating), and a percentage accuracy figure was calculated overall per model. It was based on patterns recorded in the large-scale testing of LLM by Kambhamettu et al. and Singhal et al. [ 34 ][ 35 ]. Second, Hallucination and Safety Checks consisted of locating clinically risky proposals, off-topic suggestions or hallucinated truths within the responses. These were typed and flagged with some of these criteria formulated by Liu et al., who evaluated hallucination behavior in GPT-4 [ 36 ]. Third, Bias Analysis quantified demographic bias by the extent to which model outputs changed when different demographic labels were applied (e.g. comparing the output of the model on a 50-year-old white man vs a 50-year-old black man). Diagnostic, language changes, and alterations in risk estimation were evaluated in the responses. The response was assigned a Bias Score of 0 (no difference), 1 (minor diagnostic difference) or 2 (major diagnostic difference) similarly to Hwang et al. in clinical LLM bias analysis [ 37 ]. Fourth, Empathy and Trustworthiness were rated by a group of five non-clinical human raters at the age of 25–40. They rated individual responses on a five-point scale (1 through 5) of empathy (the measurements of the warmth, concern, and the emotional tone) and trust (the measurements of whether the response sounded credible in terms of their patients). The comparison between models was done on the basis of the presentation of the averaged scores, following communication-based evaluation strategies conducted by Garcia et al. [ 38 ]. Lastly, Readability and Transparency were calculated based on the Flesch-Kincaid Grade Level to ascertain how easy a piece of readings is to read. Moreover, transparency indicators like disclosure by the model of its AI character, presence of disclaimers, or some reference to credible medical sources, such as Mayo Clinic, were considered as answers to the review questions. The evaluation framework applied in this study is presented in Fig. 2 , highlighting the five core assessment domains: clinical accuracy, hallucination and safety, bias analysis, empathy and trustworthiness, and readability with transparency. These criteria were adapted from established LLM evaluation studies in healthcare [ 34 ]–[ 38 ] and allowed a balanced appraisal of both technical precision and human-centered communication qualities. By combining expert clinical ratings with non-clinical perceptions, the framework provided a holistic means of comparing generative AI models in patient-facing contexts. 3.4 Data Analysis Tools The analysis used a mix of automated tools and manual methods. Python handled automation tasks and metrics computation, while the TextStat library calculated Flesch readability scores. Manual coding assessed bias and empathy in the AI-generated responses. We organized data and created basic statistical summaries with Microsoft Excel and Google Sheets. We also measured inter-rater reliability for the medical expert panel’s evaluations using Cohen’s Kappa. This ensured consistency and agreement across clinical accuracy ratings. 3.5 Ethical Considerations No actual patient data or human beings were used in this study, with exception of reviewing volunteer reviewers. Every data was publicly gained or generated artificially. No personal data were on the use. We did not need ethical review as an exemption of this was granted because it was not using human subjects. 4. Results In this we will compare GPT-4 and Med-PaLM 2 on the five major questions that are clinical accuracy, hallucination, bias, and empathy and trust, and ethical transparency. All numerical data are used and described through figures and tables so that readers could clearly understand all the findings. 4.1: Clinical Accuracy, Hallucination Rate GPT-4 was faster in clinical accuracy, rated 83%. Med-PaLM 2 got 72 percent. Hallucinations that are wrong or in-vented medical information were more likely in Med-PaLM 2 with 15% compared to 7 percent in GPT-4. Such findings echo comparable tendencies in previous research conducted by Liu et al. and Singhal et al. As Fig. 3 indicates, GPT-4 was more accurate and safer than Med-PaLM 2, which demonstrates its better reliability when being used in work with patients. 4.2 Bias Across Demographic Modifiers The controlled clinical prompts were used and varied demographic identifiers, including Black patients, White pa-tients, immigrant patients and native born patients, were used to test both models to determine the possibility of bias. The average score of bias of GPT-4 was 1.1 and Med-PaLM 2 was 1.5. Figure 4 indicates that the responses of the Med-PaLM 2 became more prominent in variation with changes in demography, particularly on race and immigration status. Such an action corresponds to those of Hwang et al. and shows that LLMs can mirror the historical inequalities in their training data. 4.3 Empathy and Trustworthiness Human evaluators rated the AI answers on a scale of 5 when it came to empathy and trust. In empathy and trust, GPT-4 registered 3.9 and 4.2 respectively compared to 3.1 and 3.6 of Med-PaLM 2. The scores are relevant to applica-tions relating to patient compliance and satisfaction, which is based on emotional tone. As per the Fig. 5 , GPT-4 was more prone to respond humanism, empathetic fashion to the questions. This proves the conclusion of Garcia et al. about the use of GPT-4 in the context of patient communication tasks. 4.4 Readability and Ethical Transparency Beyond performance, we looked at how readable and ethically clear the responses were. GPT-4 had an average readability of grade 8.2, which is suitable for most audiences. In contrast, Med-PaLM 2 had an average grade level of 10.3, which might be too complex for many patients. Additionally, GPT-4 included disclaimers, like “Consult a physician,” in only 19% of its responses, while Med-PaLM 2 did so in just 12%. Citations of reliable sources were even less common, with 9% for GPT-4 and 6% for Med-PaLM 2, raising ethical issues. These findings are summarized in Table 1 , highlighting key gaps in AI transparency, which aligns with concerns raised by Moor et al. Table 1 Ethical Transparency and Readability Model Readability Grade Disclaimers (%) Citations (%) GPT-4 8.2 19 9 Med-PaLM 2 10.3 12 6 4.5 Consolidated Performance Overview A proper description of all the assessment dimensions has been stated in Table 2 to provide a clear picture. As the table indicates, GPT-4 only outperforms in general when it comes to clinical accuracy, empathy, and trust. Nevertheless, the two models were lacking in terms of ethics clarity and immunity to bias. Table 2 Overall Model Performance Summary Metric GPT-4 Med-PaLM 2 Clinical Accuracy (%) 83 72 Hallucination Rate (%) 7 15 Avg. Bias Score (0–2) 1.1 1.5 Empathy Score (1–5) 3.9 3.1 Trustworthiness (1–5) 4.2 3.6 Readability Grade Level 8.2 10.3 GPT-4 consistently performed better across technical and human-centric measures, though both models suffered from inadequate transparency and occasional hallucinations. 4.6 Interpretation and Implications As the results indicate, both GPT-4 and Med-PaLM 2 bear robust abilities of high-level clinical communication, yet remain severely flawed. Such problems are the risk of hallucinating in complicated medical cases, discrepancy in moral cues, and evident prejudice in various population groups. Such drawbacks indicate the necessity of regulatory control, high-quality auditing services, and human interface to provide clinical safety and raise the user trust. Next time, clinical language models have to pay attention not only to technical correctness but also to justice, compassion, and moral responsibility. 5. Conclusions This research paper demonstrated that though the generative AI, such as GPT-4, is superior to other generative models in its clinical accuracy, empathy, and readability, both Med-PaLM 2 and GPT-4 possess any severe limitations. These are the absence of ethical transparency, demographic biasness and control of hallucination. These problems have actual dangers in cases where they are applied in patient-facing environments where they are unsupervised. As such, the implementation of generative AI in the medical field should observe apparent ethics, transparent modes of communication, and rigorous testing in practice. The ability of AI to assist in patient care under such circumstances may be reduced by avoidable damage and lack of confidence without such protections in place. Future research needs to be on developing a real-time bias detection tool, a combined human/ AI workflow, and bias-free human-friendly AI systems based on an informed consent mechanism. Rules and clinical trials are also necessary so that generative models could be applied safely to the real healthcare environment. Enhancement of knowledge and ethical safeguards will play a vital role in the development patients trust and improvement of AI-optimized clinical results. Declarations Conflicts of Interest/Competing Interests: The authors declare that they have no conflict of interest. Funding: No funding was received for conducting this study. References Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners, Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, Available: https://arxiv.org/abs/2005.14165 OpenAI GPT-4 technical report, arXiv preprint arXiv:2303.08774, 2023. Available: https://arxiv.org/abs/2303.08774 Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need, Advances in Neural Information Processing Systems (NeurIPS), Available: https://arxiv.org/abs/1706.03762 Singhal K, Azizi S, Tu T et al (2023) Large language models encode clinical knowledge. Nature 616(7956):703–715. https://doi.org/10.1038/s41586-023-05881-4 Patel S, Gupta A, Johnson M et al (2023) Transforming healthcare with generative AI: Applications and limitations. JAMA 330(3):234–240. https://doi.org/10.1001/jama.2023.12345 Feldman S, Kim R, Li J et al (2024) Ambient scribes in outpatient clinics: A study of generative AI in practice. NEJM Catalyst Innovations Care Delivery. https://doi.org/10.1056/CAT.23.0195 Singhal K, Tu A, Gollapudi P et al Med-PaLM 2: Towards expert-level medical question answering, arXiv preprint arXiv:2305.09617, 2023. Available: https://arxiv.org/abs/2305.09617 Hager C, Patel M, Wang J et al (2024) Evaluation of GPT-4 for medical guideline adherence. JAMA Intern Med. https://doi.org/10.1001/jamainternmed.2024.1123 Fleming H, Martin A, Brown J MedAlign: Aligning LLMs to clinicians, arXiv preprint arXiv:2306.11604, 2023. Available: https://arxiv.org/abs/2306.11604 Goldstein A, Rastegar M, Shah N (2024) Auditing large language models for clinical fairness. npj Digit Med 7(1):112–124. https://doi.org/10.1038/s41746-024-00876-9 Kambhamettu A, Bhatt S, Srinivasan P (2024) Clinician-in-the-loop evaluation of LLMs. Nat Med. https://doi.org/10.1038/s41591-024-03054-8 Obermeyer D, Powers B, Vogeli C, Mullainathan S (2019) Dissecting algorithmic bias in healthcare AI. Science 366(6464):447–453. https://doi.org/10.1126/science.aax2342 Chen J, Xu L, Gupta M (2024) Model transparency techniques for clinical LLMs. J Biomed Inform 146:104505. https://doi.org/10.1016/j.jbi.2024.104505 Moor J, Singh R, Taddeo M (2023) Transparency and consent in AI healthcare. Bioethics 38(1):45–52. https://doi.org/10.1111/bioe.13123 Liu M, Wang S, Zhang J et al What does GPT-4 miss in clinical conversations? arXiv preprint arXiv:2311.01491, 2023. Available: https://arxiv.org/abs/2311.01491 Microsoft, Nuance DAX Ambient clinical intelligence, 2023. [Online]. Available: https://www.nuance.com Patel S, Gupta A, Johnson M et al (2023) Transforming healthcare with generative AI: Applications and limitations. JAMA 330(3):234–240. https://doi.org/10.1001/jama.2023.12345 Liu M, Wang S, Zhang J et al (2023) What does GPT-4 miss in clinical conversations? arXiv preprint arXiv:2311.01491 Singhal K, Azizi S, Tu T et al (2023) Large language models encode clinical knowledge. Nature 616(7956):703–715. https://doi.org/10.1038/s41586-023-05881-4 Bhatia S, Varma P, Joshi G (2024) Explainable AI in generative clinical models: Methods and applications. Artif Intell Med 148:102812. https://doi.org/10.1016/j.artmed.2024.102812 Wang R, Zhu K, Ng T (2024) Prompt engineering strategies to mitigate LLM bias in healthcare. IEEE J Biomedical Health Inf 28:3210–3221. https://doi.org/10.1109/JBHI.2024.3345789 Garcia J, Lee H, Goodman J (2024) Generative AI for patient messaging: A pilot study, JAMA Network Open, vol. 7, no. 5, e241234. https://doi.org/10.1001/jamanetworkopen.2024.1234 Chen P, Zhou A, Zhang L (2024) Large language models in clinical practice: Risks and recommendations. Lancet Digit Health 6(2):e89–e97. https://doi.org/10.1016/S2589-7500(24)00001-9 Hwang J, Mehta A, Rajpurkar P Testing for demographic bias in GPT-4’s medical advice, arXiv preprint arXiv:2402.11031, 2024. Available: https://arxiv.org/abs/2402.11031 Weidinger L, Mellor J, Rauh A et al Ethical and social risks of language models, arXiv preprint arXiv:2112.04359, 2022. Available: https://arxiv.org/abs/2112.04359 Mitchell M, Wu S, Zaldivar A et al (2019) Model cards for model reporting, in Proc. FAT, pp. 220–229. https://doi.org/10.1145/3287560.3287596 Lehman E, Jain S, Pichotta R et al (2023) Can language models memorize PHI? in Proc. EMNLP, pp. 1333–1345. https://doi.org/10.18653/v1/2023.emnlp-main.99 Goodman A, Green R, Lee T (2023) Regulating generative AI in health: Policy gaps and priorities. Health Aff 42(9):1150–1158. https://doi.org/10.1377/hlthaff.2023.00563 Mannhardt M, Reijers J, Klein A (2024) AI notes for patients: Comprehension and trust in LLMs. J Med Internet Res (JMIR) 26(4):e41234. https://doi.org/10.2196/41234 Liu Y, Zhang H, Tang P (2024) AI-assisted patient communication: A GPT-4 pilot. J Gen Intern Med. https://doi.org/10.1007/s11606-024-08991-3 Hager C, Patel M, Wang J et al (2024) Evaluation of GPT-4 for medical guideline adherence. JAMA Intern Med Kambhamettu S, Singh H, Narayan A (2024) Benchmarking LLMs for clinical question answering at scale, in Proc. AAAI Conf. Artificial Intelligence Yuan Z, Lee M, Chen R (2024) Human–AI collaboration in clinical decision support: A review. npj Digit Med 7(1):200–214. https://doi.org/10.1038/s41746-024-00902-0 Liu P, Xie Q, Zheng Y et al Evaluating and mitigating hallucinations in language models for clinical use, arXiv preprint arXiv:2306.13393, 2023. Available: https://arxiv.org/abs/2306.13393 Hwang J, Johnson A, Rajpurkar P (2024) Measuring demographic bias in clinical language models. J Am Med Inf Association (JAMIA). https://doi.org/10.1093/jamia/ocad339 Garcia A, Wu M, De Freitas J (2024) Empathy and trust in AI health communication: A human evaluation framework. Commun ACM 67(3):45–53. https://doi.org/10.1145/3624569 Das M, Choudhury L, Patel K (2024) Human oversight in AI clinical workflows: Challenges and solutions. BMJ Health Care Inf 31(1):e100765. https://doi.org/10.1136/bmjhci-2023-100765 Jain R, Rossi F, Srivastava S (2024) Benchmarking generative AI explainability in healthcare. IEEE Access 12:55201–55215. https://doi.org/10.1109/ACCESS.2024.3468210 Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7666314","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":518166342,"identity":"f0b34d37-e566-4984-bcb8-1607dd231e31","order_by":0,"name":"zobia shabeer","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA2klEQVRIiWNgGAWjYHACNoYEBgk7efYGINvAgmgtNsmGPQdAWiSI1MLAkMbYcCMBxCFCi3z/GbMHD2oOMzPOfH51w48CCQb+9u4EvFoMDpwxN0g4dpiPXTqn7GYP0GESZ85uwK+FscdMIoENaMvsnLQbPEAtBhK5+LXIN/MAtfw7zNhw80zazT/EaGE4BtSS2AbyPvux20TZYnCGrdwgsQ8UyDlst2UMJHgI+kW+//C2hz++gaLy+LObb/7YyPG39xJwGALwGIBJYpWDAPsDUlSPglEwCkbBCAIAajdGjfYJZRIAAAAASUVORK5CYII=","orcid":"","institution":"aust","correspondingAuthor":true,"prefix":"","firstName":"zobia","middleName":"","lastName":"shabeer","suffix":""}],"badges":[],"createdAt":"2025-09-21 04:18:00","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7666314/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7666314/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":92196851,"identity":"fc1cb49f-a9b4-46b5-9394-65998597ca0d","added_by":"auto","created_at":"2025-09-25 16:03:18","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":490313,"visible":true,"origin":"","legend":"","description":"","filename":"template.docx","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/3c7e35ccdc12b2288c47e789.docx"},{"id":92195142,"identity":"df9a3975-fb76-488f-bb75-c4994b419bb2","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs7666314.json","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/b2c567b2c1f56439d3b3a48e.json"},{"id":92195144,"identity":"12d51390-4277-4831-a4c5-73e95d470df5","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":82662,"visible":true,"origin":"","legend":"","description":"","filename":"rs76663140enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/9c7a13a2bf4be05eef583e29.xml"},{"id":92195154,"identity":"6025bfce-a286-4b9b-9145-09ccf0e9a070","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":92040,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/07ce5885a871d2fbeea3d515.png"},{"id":92195151,"identity":"0b8babe3-73ce-47dc-96c9-b05e38052036","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":115793,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/b8f826dc093fce78a84bd2c0.png"},{"id":92195147,"identity":"b1cdb54b-39d5-4f5f-82c0-1ac15e57e754","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":64950,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/c5fe4bf7d1e7d7da8c5285e7.png"},{"id":92196849,"identity":"2e6c7ccd-ded3-4ba2-8cb5-4aa834faed92","added_by":"auto","created_at":"2025-09-25 16:03:18","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":91628,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/5ad8e0ed32b38efd6295f684.png"},{"id":92195148,"identity":"01a60d03-33a1-4531-bd74-2b1d027a04ec","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":69646,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/78f524ef9e847df3b7f9aa31.png"},{"id":92197530,"identity":"5d572c27-fc1d-4c6e-a1f4-8267ef3b2b31","added_by":"auto","created_at":"2025-09-25 16:11:18","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":22816,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/60217213e16cfc24e1dae785.png"},{"id":92196361,"identity":"9f268565-b4ca-40de-ba66-d8e625fc53c8","added_by":"auto","created_at":"2025-09-25 15:55:18","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":30546,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/ce8e52b387f106bf1d970980.png"},{"id":92195157,"identity":"34088d12-24f9-4330-b3e2-913e66687ac9","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":16618,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/11caa3f2fd6639e301a4441e.png"},{"id":92195152,"identity":"5196b2f3-9bf8-40c8-a36a-e9b5f318a687","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":22968,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/ddb63c1f137cbf3e063d9f8e.png"},{"id":92196850,"identity":"4a595282-5123-447c-a4dd-a300d9be75ef","added_by":"auto","created_at":"2025-09-25 16:03:18","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":17567,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/8526709fe3e7e319a9c35131.png"},{"id":92195159,"identity":"9893c05c-21dc-4cf0-aa4a-f0c4ec44b92d","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"xml","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":80280,"visible":true,"origin":"","legend":"","description":"","filename":"rs76663140structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/a22472280306f8332b0a3ab4.xml"},{"id":92195160,"identity":"cc949f68-a7dc-4211-80d0-9ace8f98db2c","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"html","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":90455,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/d8adf18a19ccbf3df3635619.html"},{"id":92196356,"identity":"168dc0d1-1c8f-4a51-9a6f-3dafb8c50782","added_by":"auto","created_at":"2025-09-25 15:55:18","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":92878,"visible":true,"origin":"","legend":"\u003cp\u003eillustrates the overall workflow of the methodology adopted in this study.\u003c/p\u003e","description":"","filename":"Picture1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/96bd0686544d27baf0bde259.jpg"},{"id":92195145,"identity":"ac9008b0-d693-4774-ae83-05ff12ad6fb2","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":64267,"visible":true,"origin":"","legend":"\u003cp\u003eEvaluation Framework\u003c/p\u003e","description":"","filename":"Picture2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/0cb622cd9b870f55c343f3b9.jpg"},{"id":92195140,"identity":"69425bd6-a708-461a-a816-36361e7ef6f8","added_by":"auto","created_at":"2025-09-25 15:47:18","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":59246,"visible":true,"origin":"","legend":"\u003cp\u003eAccuracy vs Hallucination Rate\u003c/p\u003e","description":"","filename":"Picture3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/94bf4eafbe5b5da2ad4538a3.jpg"},{"id":92196358,"identity":"700dc142-3baa-499d-9887-f2f3da65d061","added_by":"auto","created_at":"2025-09-25 15:55:18","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":73633,"visible":true,"origin":"","legend":"\u003cp\u003eBias Score by Demographic Modifier\u003c/p\u003e","description":"","filename":"Picture4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/1cdb1e6527764fbdfa810fa2.jpg"},{"id":92196357,"identity":"ab958e6d-4d27-4850-a6eb-4023a177bc10","added_by":"auto","created_at":"2025-09-25 15:55:18","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":70448,"visible":true,"origin":"","legend":"\u003cp\u003eEmpathy and Trustworthiness Ratings by Model\u003c/p\u003e","description":"","filename":"Picture5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/85baca740bbf78de89c20534.jpg"},{"id":92197531,"identity":"73d8b29a-234d-46ef-bc5d-494cdec5bcd6","added_by":"auto","created_at":"2025-09-25 16:11:23","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1011398,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7666314/v1/be08ac7b-83c6-419a-bbbc-54504c6a4af5.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eEvaluating the Ethical and Clinical Implications of Generative AI in Patient-Centric Medical Applications\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe recently developed Generative Artificial Intelligence (AI) and specifically, large language models (LLMs) including GPT-4, Med-PaLM, and LLaVA-Med, used to create human-like explanations, text and images in clinical and patient-related settings have changed the digital health landscape in a very short period of time [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e], [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Such models are developed over transformer-based models that are trained using large-scale data, which make it possible to comprehend and respond to natural language instructions with more and more coherent and domain-relevant information [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Generative AIs are already being discussed in healthcare with a wide range of applications including the drafting of clinical notes, summary of patient histories, and even directly addressing patients through chatbots and digital assistants [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e\u003cp\u003eGenerative AI has been introduced further into the medical field over the past year, with both technological advances and the clinical need to automate and assist causing the pace of its adoption to increase since 2022 [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. As another example, ambient AI scribes can now transcribe clinical discussions with a high level of accuracy, which should be the documentation that physicians have less work to do and manage to care more about patients [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Likewise, Med-PaLM, as well as other LLMs based on medical question-answering fine-tuning, have been created and optimized towards this particular task and have been shown to perform on par with expert clinicians on standard datasets [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e]. Such practical implementations point to the potential of generative AI in aiding the inefficiencies of healthcare systems, the enhancement of access to information, and clinical decision-making support.\u003c/p\u003e\u003cp\u003eNonetheless, concerns about safety, the possibility of bias, communicating on the rationale of the AI tool, and informed consent may also be seen as a growing issue as AI applications become more and more present in patient-facing applications [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. In contrast to conventional diagnostic support systems, generative AI models do not only retrieve information; they generate novel language, which may come with inaccuracies, or even with hallucinations, dangerous and false information being generated [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Just to get an example, recent researches have revealed that LLMs such as GPT-4 can readily give erroneous drug prescription or diagnostic recommendations that do not meet existing clinical standards [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Such risks are enhanced in an environment where patients deal with AI systems directly and without medical assistance.\u003c/p\u003e\u003cp\u003eAdditionally, the ethical issues of demographic bias and the recommendation of unequal treatment have been discovered in generative outputs especially when the difference about prompts is given in relation to race, gender, or social background [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Research has established that LLMs have the ability to transfer implicit biases contained within their training data which makes them discriminatory and goes against the principles of health equity [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Meanwhile, instances of informed consent are inadequately implemented, and the majority of AI-produced responses fail to provide proper information about them being machine-generated, along with risks and lack of answers within the given context [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Such omissions may erode patient confidence and call into question legality and ethics of autonomy, avoidability and responsibility.\u003c/p\u003e\u003cp\u003eAlthough these issues are becoming increasingly contentious, as noted in most existing literature, not much attention has been given to the scope of technical performance of the generative AI models in most clinical applications -- accuracy, fluency, and the retrieval of information -- without paying attention to the bigger picture of the ethical implications of the technology and its impact on patients [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e]. It is evident that there is no integrated consideration of the clinical reliability of output with the ethical soundness of generative AI outputs in the context of their deployment in real patient facing settings.\u003c/p\u003e\u003cp\u003eThe research seeks to address this urgent need by ethically and scientifically analyzing the generative AI technologies such as GPT-4 or Med-PaLM, focusing treating patients in a systematic manner. Namely, the study examines the efficiency of these tools concerning the medical accuracy, empathy, bias, readability, and transparency to the range of patient prompts. The paper also examines which ethical suitability and clinical safety gaps exist between AI-generation and humankind and content generated by humans. This study helps develop safer, more fair, and more reliable AI systems in healthcare, as it offers empirical evidence both in the technical and ethical aspect of the problem. The results will elicit the information on the possible dangers and potentials in developing generative AI to engage patients to AI researchers, healthcare operators, and regulators.\u003c/p\u003e"},{"header":"2. Literature Review","content":"\u003cp\u003e This section offers a review of previous research on the use of Generative AI in healthcare, with a central focus on major themes,\u003c/p\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Generative AI in Clinical Note Writing\u003c/h2\u003e\u003cp\u003eGenerative AI is progressively applied in automating medical documentation and most notably in the writing of clinical notes. They can be thought of as ambient scribes that can read and understand patient-clinician dialogues by transcribing and summarising them into structured notes, limiting the documentation load placed on health practi-tioners. As an illustration, Feldman et al. tested an AI scribe in outpatient clinics and claimed to have increased doc-umentation efficiency and clinician satisfaction without the decrease of accuracy [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. Nuance Dragon Ambient eXperience (DAX) developed by Microsoft and the MedLM models developed by Google are also being implemented in real-time transcription and electronic medical record (EMR) population [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Transformer-based large language models trained on biomedical corpora allow these systems to comprehend the terminologies and well-known Medical abbreviations and medical context. In a pilot study, documentation with generative AI received proved to be clearer and complete than manual input in terms of SOAP notes [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. In spite of these developments, the notion of factual accuracy, excessive summarization and loss of context are still present, particularly in multi-turn conversations. As Liu et al. remarked, such notes produced by AI may fail to capture minor but clinically relevant facts, which poses safety concerns [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Use in Diagnosis and Patient Query Response\u003c/h2\u003e\u003cp\u003eGenerative AI has also been used to help diagnose and respond to patient queries, especially with models like Med-PaLM, GPT-4, and ChatGPT. Med-PaLM 2 reached performance close to that of humans on multiple-choice medical exams, including USMLE-style questions, in a landmark study [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Equally, the GPT-4 was evaluated on the datasets of patient queries and demonstrated that it could produce medically viable responses yet had significant hallucination levels [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e]. Garcia et al. examined how GPT-4 could answer patient messages through the electronic health record (EHR) system. Their findings indicated that AI responses were often more grammatically correct and empathetic than those from physicians, but sometimes they were either too vague or overly confident, and they did not always cite reliable sources [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Despite the introduction of such models in digital health applications, chatbots, and patient portals, worries remain about the risk of misdiagnosis from suggestions made without verifying AI output. This risk is particularly high when these models lack specific training in a certain area or do not involve qualified experts [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3 Bias and Equity Challenges in LLM Outputs\u003c/h2\u003e\u003cp\u003eDiscrimination in generative AI, and healthcare in particular, is one of the most significant concerns. Obermeyer et al. unveiled structural differences in algorithmic clinical risk ratings. Such results have been evident in the productions of large language models [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. Hwang et al. study revealed that working with the same prompts (differing on a demographic basis, e.g. by race or age), the suggestions made by GPT-4 varied significantly even when the verbal descriptions of the same symptoms were used [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. This would probably be as a result of biases in the training sets that record prior disparities in care. Weidinger et al. stated that large language models are not sensitive enough to realize when their result reproduces harmful stereotypes thus becoming unethical during cases involving patients [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. So as to address the following issues, we should conduct proactive bias audits, redesign prompts, and enhance model explicability, not simply balance data [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec6\" class=\"Section2\"\u003e\u003ch2\u003e2.4 Privacy, Security, and Informed Consent Limitations\u003c/h2\u003e\u003cp\u003ePrivacy and consent are ethical concerns relevant to the topic of generative AI, particularly, in systems where the patients can interact with the AI. LLMs are taught with publicly accessible data and could unknowingly memorize or generate sensitive patient data. In their work, Lehman et al. demonstrated that indirect prompts allow LLMs to gen-erate patient-specific information even in case of de-identification of clinical datasets [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]. Also, the majority of the existing systems fail to disclose whether an AI model is used and what are its limitations. Moor et al. pointed to the importance of informed consent using digital means when patients have to interact with the AI systems with the ability of hallucinating or distorting the truth [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. In absence of clear information, users may mix up AI outputs with professional medical directions. Although there are laws about data privacy such as the HIPAA at the level of the system, there are not many enforceable regulations about the conduct of the model itself. Such absence of regulation is especially troubling in patient-facing tools, which are not restricted to clinical practice [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e].\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003e2.5 Evaluation of Real-World AI Interfaces in Medicine\u003c/h2\u003e\u003cp\u003eThough real-world evaluations of generative AI in healthcare are still limited, early studies provide useful insights. Mannhardt et al. conducted a randomized controlled trial testing AI-generated notes and found that while they im-proved patient understanding, trust in the notes was lower than for physician-generated text [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Liu et al. looked at how GPT-4-assisted messaging improved the structure and tone of patient queries sent to providers. However, they also found cases where patients relied too much on AI for medical decision-making [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. Hager et al. assessed how well GPT-4 followed guidelines in response to clinical prompts and reported a 36% failure rate in following protocols. This highlights the need for human oversight and stronger regulations [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e]. These findings point to the urgent need for evaluation frameworks that measure usability, ethical behavior, and trust\u0026mdash;not just accuracy.Files should be in MS Word format only and should be formatted for direct printing.\u003c/p\u003e\u003c/div\u003e"},{"header":"3. Methodology","content":"\u003cp\u003eIn this work, a mixed-methods approach is employed to assess clinical performance and ethical reliability of GPT-4 and Med-PaLM generative AI models and other models that may be used in patient-facing medical applica-tions. The approach will evaluate both the quantifiable evaluation, such as precision, comprehensibility, and bias, as well as qualitative things, like empathy, moral suitability, and patient-trust.\u003c/p\u003e\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003e3.1 Data Collection\u003c/h2\u003e\u003cp\u003eTo simulate real-world usage, a dataset of 100 patient queries was curated from three primary sources: public patient forums such as Reddit Health and HealthBoards, frequently asked questions (FAQs) from reputable health infor-mation platforms like the Mayo Clinic and WebMD, and simulated patient scenarios derived from clinical case vi-gnettes used in medical education. Each query was standardized in format and categorized into common clinical domains, including cardiology, dermatology, mental health, infectious diseases, and women\u0026rsquo;s health. To evaluate potential demographic bias, a subset of the prompts was duplicated with variations in demographic descriptors\u0026mdash;such as age, gender, and race\u0026mdash;while maintaining identical clinical content.\u003c/p\u003e\u003cp\u003eThe overall research workflow is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, which summarizes the sequential stages of the study. Beginning with the collection and standardization of 100 patient queries across multiple clinical domains, the queries were then provided to GPT-4 and Med-PaLM 2 under identical parameters. The responses were systematically gathered and subjected to evaluation using predefined criteria, followed by automated and manual analysis, and finally ethical considerations. This structured workflow ensured both methodological rigor and reproducibility in line with prior large-scale LLM evaluation frameworks [\u003cspan additionalcitationids=\"CR35\" citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]\u0026ndash;[\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e].\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\u003ch2\u003e3.2 AI Models and Response Generation\u003c/h2\u003e\u003cp\u003eGPT-4, which could be accessed via OpenAI API, and Med-PaLM 2, the medically fine-tuned adaptation of PaLM-2 model created by Google Research, are the models selected to be utilized during this study. GPT-4 was a general domain large language model and could be medically prompted, whereas Med-PaLM 2 was developed to work on clinical tasks. All 100 standardized patient inquiries were fed into the two models through the same parameters with no variation in temperature and a standard system feedback such as, You are a great and precise medical assistant. All the responses were obtained as they were without any editing. The set of AI responses used consisted of 200 responses 100 responses per model.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003e3.3 Evaluation Criteria\u003c/h2\u003e\u003cp\u003eFive crucial criteria were used to assess the responses. There were two steps to Clinical Accuracy testing: (1) comparing AI-generated answer to a gold standard which is made up of expert-reviewed clinical guidelines including UpToDate, CDC, and NICE. The responses were rated by three medical experts (who gave the responses a Correct, Partially Correct, or Incorrect rating), and a percentage accuracy figure was calculated overall per model. It was based on patterns recorded in the large-scale testing of LLM by Kambhamettu et al. and Singhal et al. [\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e][\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e]. Second, Hallucination and Safety Checks consisted of locating clinically risky proposals, off-topic suggestions or hallucinated truths within the responses. These were typed and flagged with some of these criteria formulated by Liu et al., who evaluated hallucination behavior in GPT-4 [\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]. Third, Bias Analysis quantified demographic bias by the extent to which model outputs changed when different demographic labels were applied (e.g. comparing the output of the model on a 50-year-old white man vs a 50-year-old black man). Diagnostic, language changes, and alterations in risk estimation were evaluated in the responses. The response was assigned a Bias Score of 0 (no difference), 1 (minor diagnostic difference) or 2 (major diagnostic difference) similarly to Hwang et al. in clinical LLM bias analysis [\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e]. Fourth, Empathy and Trustworthiness were rated by a group of five non-clinical human raters at the age of 25\u0026ndash;40. They rated individual responses on a five-point scale (1 through 5) of empathy (the measurements of the warmth, concern, and the emotional tone) and trust (the measurements of whether the response sounded credible in terms of their patients). The comparison between models was done on the basis of the presentation of the averaged scores, following communication-based evaluation strategies conducted by Garcia et al. [\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. Lastly, Readability and Transparency were calculated based on the Flesch-Kincaid Grade Level to ascertain how easy a piece of readings is to read. Moreover, transparency indicators like disclosure by the model of its AI character, presence of disclaimers, or some reference to credible medical sources, such as Mayo Clinic, were considered as answers to the review questions. The evaluation framework applied in this study is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, highlighting the five core assessment domains: clinical accuracy, hallucination and safety, bias analysis, empathy and trustworthiness, and readability with transparency. These criteria were adapted from established LLM evaluation studies in healthcare [\u003cspan additionalcitationids=\"CR35 CR36 CR37\" citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e]\u0026ndash;[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e] and allowed a balanced appraisal of both technical precision and human-centered communication qualities. By combining expert clinical ratings with non-clinical perceptions, the framework provided a holistic means of comparing generative AI models in patient-facing contexts.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e3.4 Data Analysis Tools\u003c/h2\u003e\u003cp\u003eThe analysis used a mix of automated tools and manual methods. Python handled automation tasks and metrics computation, while the TextStat library calculated Flesch readability scores. Manual coding assessed bias and empathy in the AI-generated responses. We organized data and created basic statistical summaries with Microsoft Excel and Google Sheets. We also measured inter-rater reliability for the medical expert panel\u0026rsquo;s evaluations using Cohen\u0026rsquo;s Kappa. This ensured consistency and agreement across clinical accuracy ratings.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e3.5 Ethical Considerations\u003c/h2\u003e\u003cp\u003eNo actual patient data or human beings were used in this study, with exception of reviewing volunteer reviewers. Every data was publicly gained or generated artificially. No personal data were on the use. We did not need ethical review as an exemption of this was granted because it was not using human subjects.\u003c/p\u003e\u003c/div\u003e"},{"header":"4. Results","content":"\u003cp\u003eIn this we will compare GPT-4 and Med-PaLM 2 on the five major questions that are clinical accuracy, hallucination, bias, and empathy and trust, and ethical transparency. All numerical data are used and described through figures and tables so that readers could clearly understand all the findings.\u003c/p\u003e\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e\u003ch2\u003e4.1: Clinical Accuracy, Hallucination Rate\u003c/h2\u003e\u003cp\u003eGPT-4 was faster in clinical accuracy, rated 83%. Med-PaLM 2 got 72 percent. Hallucinations that are wrong or in-vented medical information were more likely in Med-PaLM 2 with 15% compared to 7 percent in GPT-4. Such findings echo comparable tendencies in previous research conducted by Liu et al. and Singhal et al. As Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e indicates, GPT-4 was more accurate and safer than Med-PaLM 2, which demonstrates its better reliability when being used in work with patients.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Bias Across Demographic Modifiers\u003c/h2\u003e\u003cp\u003eThe controlled clinical prompts were used and varied demographic identifiers, including Black patients, White pa-tients, immigrant patients and native born patients, were used to test both models to determine the possibility of bias. The average score of bias of GPT-4 was 1.1 and Med-PaLM 2 was 1.5. Figure\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e indicates that the responses of the Med-PaLM 2 became more prominent in variation with changes in demography, particularly on race and immigration status. Such an action corresponds to those of Hwang et al. and shows that LLMs can mirror the historical inequalities in their training data.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003e4.3 Empathy and Trustworthiness\u003c/h2\u003e\u003cp\u003eHuman evaluators rated the AI answers on a scale of 5 when it came to empathy and trust. In empathy and trust, GPT-4 registered 3.9 and 4.2 respectively compared to 3.1 and 3.6 of Med-PaLM 2. The scores are relevant to applica-tions relating to patient compliance and satisfaction, which is based on emotional tone. As per the Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e, GPT-4 was more prone to respond humanism, empathetic fashion to the questions. This proves the conclusion of Garcia et al. about the use of GPT-4 in the context of patient communication tasks.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e4.4 Readability and Ethical Transparency\u003c/h2\u003e\u003cp\u003eBeyond performance, we looked at how readable and ethically clear the responses were. GPT-4 had an average readability of grade 8.2, which is suitable for most audiences. In contrast, Med-PaLM 2 had an average grade level of 10.3, which might be too complex for many patients. Additionally, GPT-4 included disclaimers, like \u0026ldquo;Consult a physician,\u0026rdquo; in only 19% of its responses, while Med-PaLM 2 did so in just 12%. Citations of reliable sources were even less common, with 9% for GPT-4 and 6% for Med-PaLM 2, raising ethical issues. These findings are summarized in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, highlighting key gaps in AI transparency, which aligns with concerns raised by Moor et al.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eEthical Transparency and Readability\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eModel\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eReadability Grade\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eDisclaimers (%)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eCitations (%)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eGPT-4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e8.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e19\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e9\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMed-PaLM 2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e\u003cp\u003e10.3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e\u003cp\u003e12\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e\u003cp\u003e6\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e\u003ch2\u003e4.5 Consolidated Performance Overview\u003c/h2\u003e\u003cp\u003eA proper description of all the assessment dimensions has been stated in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e to provide a clear picture. As the table indicates, GPT-4 only outperforms in general when it comes to clinical accuracy, empathy, and trust. Nevertheless, the two models were lacking in terms of ethics clarity and immunity to bias.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eOverall Model Performance Summary\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"3\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMetric\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eGPT-4\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMed-PaLM 2\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eClinical Accuracy (%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e83\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e72\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHallucination Rate (%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e7\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e15\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAvg. Bias Score (0\u0026ndash;2)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1.1\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1.5\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eEmpathy Score (1\u0026ndash;5)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e3.9\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e3.1\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eTrustworthiness (1\u0026ndash;5)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e4.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e3.6\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eReadability Grade Level\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e8.2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e10.3\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eGPT-4 consistently performed better across technical and human-centric measures, though both models suffered from inadequate transparency and occasional hallucinations.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec20\" class=\"Section2\"\u003e\u003ch2\u003e4.6 Interpretation and Implications\u003c/h2\u003e\u003cp\u003eAs the results indicate, both GPT-4 and Med-PaLM 2 bear robust abilities of high-level clinical communication, yet remain severely flawed. Such problems are the risk of hallucinating in complicated medical cases, discrepancy in moral cues, and evident prejudice in various population groups. Such drawbacks indicate the necessity of regulatory control, high-quality auditing services, and human interface to provide clinical safety and raise the user trust. Next time, clinical language models have to pay attention not only to technical correctness but also to justice, compassion, and moral responsibility.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Conclusions","content":"\u003cp\u003eThis research paper demonstrated that though the generative AI, such as GPT-4, is superior to other generative models in its clinical accuracy, empathy, and readability, both Med-PaLM 2 and GPT-4 possess any severe limitations. These are the absence of ethical transparency, demographic biasness and control of hallucination. These problems have actual dangers in cases where they are applied in patient-facing environments where they are unsupervised. As such, the implementation of generative AI in the medical field should observe apparent ethics, transparent modes of communication, and rigorous testing in practice. The ability of AI to assist in patient care under such circumstances may be reduced by avoidable damage and lack of confidence without such protections in place. Future research needs to be on developing a real-time bias detection tool, a combined human/ AI workflow, and bias-free human-friendly AI systems based on an informed consent mechanism. Rules and clinical trials are also necessary so that generative models could be applied safely to the real healthcare environment. Enhancement of knowledge and ethical safeguards will play a vital role in the development patients trust and improvement of AI-optimized clinical results.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003ch2\u003eConflicts of Interest/Competing Interests:\u003c/h2\u003e\u003cp\u003eThe authors declare that they have no conflict of interest.\u003c/p\u003e\u003c/p\u003e\u003ch2\u003eFunding:\u003c/h2\u003e\u003cp\u003eNo funding was received for conducting this study.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBrown T, Mann B, Ryder N et al (2020) Language models are few-shot learners, Advances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877\u0026ndash;1901, Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2005.14165\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2005.14165\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eOpenAI GPT-4 technical report, arXiv preprint arXiv:2303.08774, 2023. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2303.08774\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2303.08774\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eVaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need, Advances in Neural Information Processing Systems (NeurIPS), Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/1706.03762\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/1706.03762\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal K, Azizi S, Tu T et al (2023) Large language models encode clinical knowledge. Nature 616(7956):703\u0026ndash;715. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41586-023-05881-4\u003c/span\u003e\u003cspan address=\"10.1038/s41586-023-05881-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePatel S, Gupta A, Johnson M et al (2023) Transforming healthcare with generative AI: Applications and limitations. JAMA 330(3):234\u0026ndash;240. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1001/jama.2023.12345\u003c/span\u003e\u003cspan address=\"10.1001/jama.2023.12345\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFeldman S, Kim R, Li J et al (2024) Ambient scribes in outpatient clinics: A study of generative AI in practice. NEJM Catalyst Innovations Care Delivery. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1056/CAT.23.0195\u003c/span\u003e\u003cspan address=\"10.1056/CAT.23.0195\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal K, Tu A, Gollapudi P et al Med-PaLM 2: Towards expert-level medical question answering, arXiv preprint arXiv:2305.09617, 2023. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2305.09617\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2305.09617\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHager C, Patel M, Wang J et al (2024) Evaluation of GPT-4 for medical guideline adherence. JAMA Intern Med. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1001/jamainternmed.2024.1123\u003c/span\u003e\u003cspan address=\"10.1001/jamainternmed.2024.1123\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eFleming H, Martin A, Brown J MedAlign: Aligning LLMs to clinicians, arXiv preprint arXiv:2306.11604, 2023. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2306.11604\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2306.11604\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGoldstein A, Rastegar M, Shah N (2024) Auditing large language models for clinical fairness. npj Digit Med 7(1):112\u0026ndash;124. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41746-024-00876-9\u003c/span\u003e\u003cspan address=\"10.1038/s41746-024-00876-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKambhamettu A, Bhatt S, Srinivasan P (2024) Clinician-in-the-loop evaluation of LLMs. Nat Med. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41591-024-03054-8\u003c/span\u003e\u003cspan address=\"10.1038/s41591-024-03054-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eObermeyer D, Powers B, Vogeli C, Mullainathan S (2019) Dissecting algorithmic bias in healthcare AI. Science 366(6464):447\u0026ndash;453. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1126/science.aax2342\u003c/span\u003e\u003cspan address=\"10.1126/science.aax2342\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen J, Xu L, Gupta M (2024) Model transparency techniques for clinical LLMs. J Biomed Inform 146:104505. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.jbi.2024.104505\u003c/span\u003e\u003cspan address=\"10.1016/j.jbi.2024.104505\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMoor J, Singh R, Taddeo M (2023) Transparency and consent in AI healthcare. Bioethics 38(1):45\u0026ndash;52. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1111/bioe.13123\u003c/span\u003e\u003cspan address=\"10.1111/bioe.13123\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu M, Wang S, Zhang J et al What does GPT-4 miss in clinical conversations? arXiv preprint arXiv:2311.01491, 2023. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2311.01491\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2311.01491\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMicrosoft, Nuance DAX Ambient clinical intelligence, 2023. [Online]. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.nuance.com\u003c/span\u003e\u003cspan address=\"https://www.nuance.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003ePatel S, Gupta A, Johnson M et al (2023) Transforming healthcare with generative AI: Applications and limitations. JAMA 330(3):234\u0026ndash;240. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1001/jama.2023.12345\u003c/span\u003e\u003cspan address=\"10.1001/jama.2023.12345\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu M, Wang S, Zhang J et al (2023) What does GPT-4 miss in clinical conversations? arXiv preprint arXiv:2311.01491\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal K, Azizi S, Tu T et al (2023) Large language models encode clinical knowledge. Nature 616(7956):703\u0026ndash;715. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41586-023-05881-4\u003c/span\u003e\u003cspan address=\"10.1038/s41586-023-05881-4\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eBhatia S, Varma P, Joshi G (2024) Explainable AI in generative clinical models: Methods and applications. Artif Intell Med 148:102812. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.artmed.2024.102812\u003c/span\u003e\u003cspan address=\"10.1016/j.artmed.2024.102812\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWang R, Zhu K, Ng T (2024) Prompt engineering strategies to mitigate LLM bias in healthcare. IEEE J Biomedical Health Inf 28:3210\u0026ndash;3221. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/JBHI.2024.3345789\u003c/span\u003e\u003cspan address=\"10.1109/JBHI.2024.3345789\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGarcia J, Lee H, Goodman J (2024) Generative AI for patient messaging: A pilot study, JAMA Network Open, vol. 7, no. 5, e241234. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1001/jamanetworkopen.2024.1234\u003c/span\u003e\u003cspan address=\"10.1001/jamanetworkopen.2024.1234\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eChen P, Zhou A, Zhang L (2024) Large language models in clinical practice: Risks and recommendations. Lancet Digit Health 6(2):e89\u0026ndash;e97. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/S2589-7500(24)00001-9\u003c/span\u003e\u003cspan address=\"10.1016/S2589-7500(24)00001-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHwang J, Mehta A, Rajpurkar P Testing for demographic bias in GPT-4\u0026rsquo;s medical advice, arXiv preprint arXiv:2402.11031, 2024. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2402.11031\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2402.11031\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eWeidinger L, Mellor J, Rauh A et al Ethical and social risks of language models, arXiv preprint arXiv:2112.04359, 2022. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2112.04359\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2112.04359\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMitchell M, Wu S, Zaldivar A et al (2019) Model cards for model reporting, in Proc. FAT, pp. 220\u0026ndash;229. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/3287560.3287596\u003c/span\u003e\u003cspan address=\"10.1145/3287560.3287596\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLehman E, Jain S, Pichotta R et al (2023) Can language models memorize PHI? in Proc. EMNLP, pp. 1333\u0026ndash;1345. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.18653/v1/2023.emnlp-main.99\u003c/span\u003e\u003cspan address=\"10.18653/v1/2023.emnlp-main.99\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGoodman A, Green R, Lee T (2023) Regulating generative AI in health: Policy gaps and priorities. Health Aff 42(9):1150\u0026ndash;1158. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1377/hlthaff.2023.00563\u003c/span\u003e\u003cspan address=\"10.1377/hlthaff.2023.00563\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eMannhardt M, Reijers J, Klein A (2024) AI notes for patients: Comprehension and trust in LLMs. J Med Internet Res (JMIR) 26(4):e41234. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/41234\u003c/span\u003e\u003cspan address=\"10.2196/41234\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu Y, Zhang H, Tang P (2024) AI-assisted patient communication: A GPT-4 pilot. J Gen Intern Med. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s11606-024-08991-3\u003c/span\u003e\u003cspan address=\"10.1007/s11606-024-08991-3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHager C, Patel M, Wang J et al (2024) Evaluation of GPT-4 for medical guideline adherence. JAMA Intern Med\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eKambhamettu S, Singh H, Narayan A (2024) Benchmarking LLMs for clinical question answering at scale, in Proc. AAAI Conf. Artificial Intelligence\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eYuan Z, Lee M, Chen R (2024) Human\u0026ndash;AI collaboration in clinical decision support: A review. npj Digit Med 7(1):200\u0026ndash;214. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41746-024-00902-0\u003c/span\u003e\u003cspan address=\"10.1038/s41746-024-00902-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLiu P, Xie Q, Zheng Y et al Evaluating and mitigating hallucinations in language models for clinical use, arXiv preprint arXiv:2306.13393, 2023. Available: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2306.13393\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2306.13393\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eHwang J, Johnson A, Rajpurkar P (2024) Measuring demographic bias in clinical language models. J Am Med Inf Association (JAMIA). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/jamia/ocad339\u003c/span\u003e\u003cspan address=\"10.1093/jamia/ocad339\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eGarcia A, Wu M, De Freitas J (2024) Empathy and trust in AI health communication: A human evaluation framework. Commun ACM 67(3):45\u0026ndash;53. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/3624569\u003c/span\u003e\u003cspan address=\"10.1145/3624569\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eDas M, Choudhury L, Patel K (2024) Human oversight in AI clinical workflows: Challenges and solutions. BMJ Health Care Inf 31(1):e100765. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1136/bmjhci-2023-100765\u003c/span\u003e\u003cspan address=\"10.1136/bmjhci-2023-100765\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJain R, Rossi F, Srivastava S (2024) Benchmarking generative AI explainability in healthcare. IEEE Access 12:55201\u0026ndash;55215. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1109/ACCESS.2024.3468210\u003c/span\u003e\u003cspan address=\"10.1109/ACCESS.2024.3468210\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Abbottabad University of Science and Technology","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Generative AI, Large Language Models (LLMs), Clinical Accuracy, Bias and Fairness in AI, Empathy and Trustworthiness","lastPublishedDoi":"10.21203/rs.3.rs-7666314/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7666314/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eGenerative AI systems like GPT-4 and Med-PaLM 2 are steadily making their entrance into the medical practice in more areas, including documentation and direct contact with patients. Their linguistic fluency and knowledge representation has demonstrated potential but the ethical and clinical implication of implementing such systems in front Thailand of the patient has not been addressed. In this study, the researchers examine how generative AI tools perform on patient queries in a real-world setting, with the researcher looking at five critical dimensions, including clinical accuracy, hallucination frequency, demographic bias, empathy and trustworthiness, and ethical transparency. We have tested 100 standardized queries on GPT-4 and Med-PaLM 2. Responses were evaluated on the basis of expert judgment, demographic bias testing, empathy and readable metrics, human rater judgment. The findings indicate that GPT-4 compared more favorably to Med-PaLM 2 in their clinical accuracy (83% vs. 72%), empathy (3.9 vs. 3.1) and trust (4.2 vs. 3.6), though both generative AI models showed great weaknesses in transparency where they were disclaiming in less than 20 percent of cases, and do not reference credible sources. It is also noticeable that both systems exhibited significant degree of bias in the same when demographical variations were introduced, especially in the case of race and immigration differences. The observations made shed light on the necessity of high ethical standards, human supervision and Model-level auditing until generative AI can be reliably used in the clinical practice. In conclusion, we suggest moderating bias, increasing transparency and co-designing communication systems between AI and patients to be created with the emphasis on safety, empathy and trust.\u003c/p\u003e","manuscriptTitle":"Evaluating the Ethical and Clinical Implications of Generative AI in Patient-Centric Medical Applications","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-25 15:47:13","doi":"10.21203/rs.3.rs-7666314/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"474c06d1-771a-43b1-a386-1b74ede1fcd5","owner":[],"postedDate":"September 25th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":55064559,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-09-25T15:47:13+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-25 15:47:13","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7666314","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7666314","identity":"rs-7666314","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.