Performance Evaluation of Large Language Models in Real-World Perinatal Medication Consultations: A Cross-Sectional Study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Performance Evaluation of Large Language Models in Real-World Perinatal Medication Consultations: A Cross-Sectional Study RAN WANG, Yifan Li, Xuewei Feng, Xin Feng This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8696873/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 27 Apr, 2026 Read the published version in International Journal of Clinical Pharmacy → Version 1 posted 9 You are reading this latest preprint version Abstract Introduction Perinatal medication consultation is a core clinical pharmacy service that involves a complex benefit–risk assessment for both maternal and fetal safety. Large language models (LLMs) have emerged as potential tools to improve access to medication information, yet their performance and safety in real-world, pharmacist-led perinatal consultation settings, particularly in non-English contexts, remain insufficiently evaluated. Aim To evaluate and compare the performance of multiple advanced large language models in addressing real-world Chinese perinatal medication consultation queries and to assess their potential role as supervised adjunctive tools within clinical pharmacy services. Method This cross-sectional study evaluated seven LLMs using real-world clinical data from pharmacist-led medication consultations at the Pharmacy Clinic of the Beijing Obstetrics and Gynecology Hospital, Capital Medical University. A standardized test set of 64 perinatal medication consultation questions was developed from 15,280 electronic consultation records collected between April 2014 and April 2024. The evaluated models included international (GPT-5.1, Grok 3, Gemini 3.0) and domestic (DeepSeek, Wenxin Yiyan, Kimi K2, Tongyi Qianwen) models. Senior clinical pharmacologists independently assessed responses across four dimensions—relevance, accuracy, usefulness, and empathy—using a 10-point Likert scale. The results are summarized as mean ± SD, and between-model differences were analyzed using non-parametric statistical tests. Results Among the 448 model-generated responses, inter-rater consistency was excellent (ICC = 0.91, 95% CI 0.88–0.94). Significant differences in the overall performance were observed among the models (p < 0.001). GPT-5.1 achieved the highest mean total score (9.1 ± 0.8), outperforming all other models (all p < 0.01), followed by Kimi K2 and DeepSeek. Accuracy was the primary determinant of performance differences, with GPT-5.1 showing the highest accuracy score (9.2 ± 0.7). Performance gaps were more pronounced in complex clinical scenarios involving comorbidities or benefit–risk trade-offs, whereas domestic models demonstrated relative advantages in consultations involving traditional Chinese medicine. Conclusion LLMs have demonstrated variable performance in response to perinatal medication consultation queries. While high-performing models show the potential to support pharmacist-led perinatal medication consultations by improving access to information, their current performance supports use only as supervised, adjunctive decision-support tools, rather than as independent sources of medication counseling. Careful governance, human oversight, and further validation of safety and reliability are required before broader integration into perinatal clinical pharmacy practices. Large language models Perinatal pharmacotherapy Medication consultation Clinical pharmacy practice Real-world evidence Impacts on Practice Large language models may be used by clinical pharmacists as supervised decision-support tools to assist with information retrieval during perinatal medication consultations but should not replace professional clinical judgment or individualized risk–benefit assessment. The independent or unsupervised use of LLMs for perinatal medication counseling poses potential patient safety risks, highlighting the need for pharmacist oversight, clear scope limitations, and structured governance in clinical practice. The integration of LLMs into perinatal pharmacy services may help improve access to medication information in resource-limited settings, provided that models are rigorously validated, regularly updated with authoritative evidence, and embedded within pharmacist-led consultation workflows. Introduction The perinatal period, encompassing pregnancy and early postpartum, involves complicated prescribing decisions with significant risks, including teratogenic effects and adverse fetal outcomes, such as congenital anomalies, preterm birth, and neurodevelopmental disorders [ 1 ]. Medication use during pregnancy and lactation requires careful benefit–risk assessment, which is often complicated by limited high-quality evidence and ethical constraints that preclude randomized controlled trials in these populations [ 2 ]. For example, selective serotonin reuptake inhibitors have been associated with an increased risk of preterm birth, whereas paroxetine exposure has been linked to a higher incidence of fetal cardiac malformations [ 3 ]. Therefore, clinical decision-making during the perinatal period requires individualized assessment, taking into account disease severity, minimal effective dosing, potential drug–drug interactions, and the risks of untreated maternal conditions. Untreated maternal illnesses, such as mental health disorders, may exacerbate adverse obstetric outcomes, including recurrent or persistent depression [ 4 ]. Perinatal pharmacotherapy consultations, frequently led by clinical pharmacists, play a critical role in supporting safe medication use during pregnancy and lactation [ 5 ]. Consultations typically rely on the integration of clinical expertise, practice guidelines, and evidence-based drug safety databases. However, accessibility and resource limitations remain substantial challenges, particularly in settings with a shortage of trained clinical pharmacists or barriers to in-person consultation [ 6 ]. Additionally, increasing the clinical workload and interdisciplinary coordination gaps may further limit the timeliness and consistency of medication counseling. While specialized resources, such as the Dutch Teratology Information Service [ 7 ], have been developed to improve access to drug safety information, their utility may be constrained by information overload, language barriers, and limited validation in complex clinical scenarios, including polypharmacy and drug interaction assessment. Large language models (LLMs), such as ChatGPT, have recently attracted attention in healthcare because of their potential to provide medication-related information and support clinical education and decision making, particularly in environments with limited access to specialist care [ 8 ]. Previous studies have reported favorable clinician perceptions of LLM utility in selected areas, including diabetes management and general medication counseling [ 9 – 10 ]. Nevertheless, important concerns remain regarding data privacy, bias, accuracy, and safety, particularly when LLMs are applied to high-risk clinical contexts. Although LLMs may offer opportunities to enhance access to perinatal medication information, their reliability and appropriateness in perinatal pharmacotherapy, where clinical consequences may affect both maternal and fetal outcomes, require careful evaluation [ 11 ]. Despite the growing interest in the clinical application of LLMs, their effectiveness in supporting perinatal pharmacotherapy consultations remains insufficiently explored. Most existing evaluations focus on general medical questions, simulated scenarios, or English language settings, with a limited assessment of real-world consultation data [ 12 ]. In addition, linguistic and contextual biases in the training data may affect LLM performance across different healthcare systems and practice environments. Therefore, robust evaluation using real-world consultation scenarios is essential to determine whether LLMs can be safely integrated as supportive tools within pharmacist-led perinatal medication services. Aim This study aimed to evaluate and compare the performance of multiple large language models in responding to real-world perinatal medication consultation queries, with a focus on accuracy, usefulness, relevance, and empathy, and to assess their potential role as supervised adjunctive decision-support tools in clinical pharmacy practice. Method Study Design and Setting This cross-sectional study assessed the performance of multiple large language models (LLMs) in addressing real-world perinatal medication consultation queries, using data derived from a pharmacist-led medication consultation service. The study was conducted using electronic consultation records from the Pharmacy Clinic of Beijing Obstetrics and Gynecology Hospital, Capital Medical University, covering the period April 2014 to April 2024. Development of the Consultation Question Set The study data were obtained from the electronic medication consultation records database maintained by the pharmacy clinic. An initial descriptive analysis of 15,280 consultation records was performed to identify frequently encountered consultation topics and commonly used medications. The development of the standardized consultation question set followed a three-step process. Topic Classification: Consultation queries were categorized into five thematic domains: medication safety during pregnancy and lactation, drug administration and dosage guidance, adverse drug reaction management, drug interactions, and complex clinical decision-making scenarios. Frequency Ranking: Queries within each thematic domain were ranked according to their frequency of occurrence in routine clinical practice. Selection of Representative Questions: The three to five most frequent questions within each domain were selected and supplemented by a limited number of low frequency but clinically complex queries to ensure adequate representation of real-world practice. This process resulted in a standardized test set of 64 perinatal medication consultation questions. The composition of the finalized question set is summarized in Supplementary Table S1 . Selection of Large Language Models Seven representative LLMs that were publicly accessible as of October 2025 were selected for evaluation. Model selection aims to include internationally leading models, widely used domestic models, and emerging high-performance systems, enabling comprehensive horizontal comparison. The evaluated models were classified as follows: 1) international models: GPT-5.1, Grok 3, and Gemini 3.0; 2) domestic models: DeepSeek, Wenxin Yiyan, Kimi K2, and Tongyi Qianwen; and 3) response generation protocols. A standardized response-generation protocol was applied to ensure consistency across the model outputs. Consultation questions were reformulated using the CO-STAR framework (Context, Objective, Style, Tone, Audience, and Response format) to minimize variability related to the prompt structure. Each model was queried once per question, reflecting a realistic first-use consultation scenario, rather than optimized prompt engineering or repeated sampling. Model responses were collected between October 13 and October 17, 2025. This approach was intended to evaluate the quality and safety of the information generated in isolated consultation scenarios, rather than to simulate real-time clinical decision-making or individualized prescriptions. Performance Evaluation Framework A panel of four senior clinical pharmacologists, each with 10–15 years of experience in perinatal pharmacotherapy, independently evaluated all the model-generated responses. Responses were assessed across four predefined dimensions (relevance, accuracy, usefulness, and empathy) using a 10-point Likert scale (1 = completely inadequate; 10 = completely adequate). The overall score for each response was calculated as the mean of the four-dimensional scores. Initial ratings were performed independently, and discrepancies were resolved through panel consensus. Statistical Analysis Statistical analyses were conducted using the SPSS software (version 26.0). Because the score distributions did not meet the assumptions of normality, non-parametric statistical methods were applied for inferential analyses. Descriptive results are reported as mean ± standard deviation (SD) to facilitate comparison with previous LLM evaluation studies. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC) based on the initial independent ratings prior to consensus discussion. Differences in performance among models were examined using the Kruskal–Wallis H test, followed by Dunn’s post-hoc test for pairwise comparisons, where appropriate. Statistical significance was defined as a two-sided p-value < 0.05. Ethics Approval The study protocol was reviewed and approved by the Ethics Committee of the Beijing Obstetrics and Gynecology Hospital, Capital Medical University (Approval No.: 2024-KY-035-01). All consultation records were fully anonymized prior to the analysis, and the study was conducted in accordance with the principles of the Declaration of Helsinki. Results Composition of the Consultation Question Set The final test set consisted of 64 unique perinatal medication consultation questions derived from an analysis of 15,280 de-identified electronic consultation records collected between April 2014 and April 2024. Based on consultation frequency and clinical characteristics, the questions were categorized into five major domains ( Supplementary Table S1) : 1) safety evaluation of specific drugs during pregnancy or lactation (n = 25, 39.1%), 2) drug administration and dosage guidance (n = 12, 18.8%), 3) management of adverse drug reactions (n = 8, 12.5%), 4) drug interactions (n = 7, 10.9%), and 5) complex therapeutic decision-making based on specific clinical scenarios (n = 12, 18.8%). This distribution reflects the most frequent and clinically relevant medication-related concerns encountered during pharmacist-led perinatal consultations in routine practice. Overall Performance of Large Language Models A total of 448 model-generated responses (64 questions × 7 models) were independently evaluated by an expert panel. Inter-rater reliability for overall scores was excellent, with an ICC of 0.91 (95% CI 0.88–0.94), indicating high consistency among evaluators. Significant differences in overall performance were observed among the seven LLMs (Kruskal–Wallis test, p < 0.001). As shown in Table 1 , GPT-5.1 achieved the highest mean overall score (9.1 ± 0.8), demonstrating consistently high performance across relevance, accuracy, usefulness, and empathy, which together reflect task alignment, clinical correctness, practical applicability, and patient-centered communication. Among the domestic models, Kimi K2 ranked highest (8.4 ± 1.2), followed by DeepSeek (8.2 ± 1.1). In contrast, Tongyi Qianwen showed a comparatively lower overall performance, particularly in terms of accuracy and usefulness. Table 1 Performance scores of large language models across evaluation dimensions Model Relevance Accuracy Usefulness Empathy Overall score GPT-5.1 9.2 ± 0.7 9.3 ± 0.9 9.1 ± 1.0 8.9 ± 1.1 9.1 ± 0.8 Kimi K2 8.8 ± 0.9 8.5 ± 1.3 8.5 ± 1.1 8.0 ± 0.9 8.4 ± 1.2 DeepSeek 8.5 ± 1.0 8.4 ± 1.2 8.3 ± 1.3 8.1 ± 0.8 8.2 ± 1.1 Gemini 3.0 8.2 ± 1.1 7.9 ± 1.4 8.0 ± 1.2 8.3 ± 1.0 8.1 ± 1.2 Wenxin Yiyan 8.3 ± 1.2 7.8 ± 1.6 7.9 ± 1.5 8.0 ± 1.3 7.9 ± 1.5 Grok 3 8.1 ± 1.0 7.5 ± 1.1 7.6 ± 1.2 7.9 ± 1.1 7.8 ± 1.1 Tongyi Qianwen 7.9 ± 1.3 7.4 ± 1.5 7.5 ± 1.4 7.8 ± 1.2 7.6 ± 1.4 Data are presented as mean ± standard deviation (SD). Relevance reflects the degree to which responses addressed the core consultation question. Accuracy reflects concordance with current clinical guidelines, authoritative references, and accepted pharmacotherapy principles. Usefulness reflects the provision of specific, actionable, and clinically applicable recommendations. Empathy reflects the use of supportive, non-judgmental, and patient-centered language. Overall score represents the mean of the four evaluation dimensions. LLM, large language model. Dimension-Specific Performance Analysis Across all evaluated dimensions, accuracy emerged as the primary factor differentiating the model performance. GPT-5.1 achieved the highest accuracy score (9.3 ± 0.9), with responses most frequently aligned with current clinical guidelines, authoritative literature, and established pharmacotherapy principles. For example, in queries concerning ibuprofen use during the second and third trimesters, GPT-5.1 consistently identified risks, such as premature closure of the ductus arteriosus, and recommended appropriate alternative analgesics. In contrast, some lower-scoring models provided vague recommendations (e.g., “use with caution”) or omitted clinically important risk information. The relevance scores were consistently high across all models, indicating that most responses directly addressed the core consultation questions without introducing irrelevant content. This finding suggests that basic task understanding is largely preserved across models. Greater variability was observed for usefulness and empathy. GPT-5.1 (8.9 ± 1.1) and Gemini 3.0 (8.3 ± 1.0) achieved the highest empathy scores, frequently using supportive and non-judgmental language that acknowledged patient concerns, such as anxiety related to medication safety during pregnancy, and encouraged appropriate clinical follow-up. The usefulness scores were closely associated with the provision of specific and actionable guidance. High-performing models, including GPT-5.1 and Kimi K2, more often provide tailored recommendations such as explicit instructions for managing missed doses of dydrogesterone ((a synthetic progestogen) during pregnancy, rather than relying on generic advisory statements. Performance by Question Complexity The model performance varied substantially according to consultation complexity (Table 2 ). For direct evidence-based questions, such as adverse drug reaction management and dosage guidance, most models achieved satisfactory performance, with mean accuracy scores exceeding 7.5. In contrast, questions related to drug safety during pregnancy and lactation demonstrated lower overall scores, reflecting increased complexity and higher risk of factual deviation in this domain. Table 2 Performance scores of large language models by consultation question type Question Type LLMS Relevance Accuracy Usefulness Empathy Overall Score Drug administrat-ion and dosage guidance GPT-5.1 9.3 ± 0.6 9.4 ± 0.8 9.2 ± 0.9 9.0 ± 1.0 9.2 ± 0.8 Kimi K2 8.9 ± 0.8 8.6 ± 1.2 8.6 ± 1.0 8.1 ± 0.8 8.5 ± 1.1 DeepSeek 8.6 ± 0.9 8.5 ± 1.1 8.4 ± 1.2 8.2 ± 0.7 8.4 ± 1.0 Gemini 3.0 8.3 ± 1.0 8.0 ± 1.3 8.1 ± 1.1 8.4 ± 0.9 8.2 ± 1.1 Wenxin Yiyan 8.4 ± 1.1 7.9 ± 1.5 8.0 ± 1.4 8.1 ± 1.2 8.1 ± 1.3 Grok 3 8.2 ± 0.9 7.6 ± 1.0 7.7 ± 1.1 8.0 ± 1.0 7.9 ± 1.0 Tongyi Qianwen 8.0 ± 1.2 7.5 ± 1.4 7.6 ± 1.3 7.9 ± 1.1 7.7 ± 1.3 Management of adverse drug reactions GPT-5.1 9.4 ± 0.5 9.5 ± 0.7 9.3 ± 0.8 9.1 ± 0.9 9.3 ± 0.7 Kimi K2 9.0 ± 0.7 8.7 ± 1.1 8.7 ± 0.9 8.2 ± 0.7 8.6 ± 1.0 DeepSeek 8.7 ± 0.8 8.6 ± 1.0 8.5 ± 1.1 8.3 ± 0.6 8.5 ± 0.9 Gemini 3.0 8.4 ± 0.9 8.1 ± 1.2 8.2 ± 1.0 8.5 ± 0.8 8.3 ± 1.0 Wenxin Yiyan 8.5 ± 1.0 8.0 ± 1.3 8.1 ± 1.2 8.2 ± 1.0 8.2 ± 1.1 Grok 3 8.3 ± 0.8 7.7 ± 0.9 7.8 ± 1.0 8.1 ± 0.9 8.0 ± 0.9 Tongyi Qianwen 8.1 ± 1.1 7.6 ± 1.2 7.7 ± 1.1 8.0 ± 1.0 7.8 ± 1.1 Drug interactions GPT-5.1 9.1 ± 0.7 9.2 ± 0.9 9.0 ± 1.0 8.8 ± 1.1 9.0 ± 0.9 Kimi K2 8.7 ± 0.9 8.4 ± 1.3 8.4 ± 1.1 7.9 ± 0.9 8.3 ± 1.2 DeepSeek 8.4 ± 1.0 8.3 ± 1.2 8.2 ± 1.3 8.0 ± 0.8 8.2 ± 1.1 Gemini 3.0 8.1 ± 1.1 7.8 ± 1.4 7.9 ± 1.2 8.2 ± 1.0 8.0 ± 1.2 Wenxin Yiyan 8.2 ± 1.2 7.7 ± 1.6 7.8 ± 1.5 7.9 ± 1.3 7.9 ± 1.4 Grok 3 8.0 ± 1.0 7.4 ± 1.1 7.5 ± 1.2 7.8 ± 1.1 7.7 ± 1.1 Tongyi Qianwen 7.8 ± 1.3 7.3 ± 1.5 7.4 ± 1.4 7.7 ± 1.2 7.5 ± 1.4 Safety evaluation of specific drugs during pregnancy or lactation GPT-5.1 9.0 ± 0.8 9.1 ± 0.9 8.9 ± 1.0 8.7 ± 1.1 8.9 ± 0.9 Kimi K2 8.6 ± 1.0 8.3 ± 1.4 8.3 ± 1.2 7.8 ± 1.0 8.2 ± 1.3 DeepSeek 8.3 ± 1.1 8.2 ± 1.3 8.1 ± 1.4 7.9 ± 0.9 8.1 ± 1.2 Gemini 3.0 8.0 ± 1.2 7.7 ± 1.5 7.8 ± 1.3 8.1 ± 1.1 7.9 ± 1.3 Wenxin Yiyan 8.3 ± 1.1 8.0 ± 1.4 8.1 ± 1.3 8.1 ± 1.1 8.1 ± 1.2 Grok 3 7.9 ± 1.0 7.3 ± 1.2 7.4 ± 1.3 7.7 ± 1.1 7.6 ± 1.2 Tongyi Qianwen 8.0 ± 1.2 7.8 ± 1.4 7.9 ± 1.3 7.9 ± 1.1 7.9 ± 1.3 Complex therapeutic decision-making based on specific clinical scenarios GPT-5.1 8.8 ± 0.9 8.9 ± 1.0 8.7 ± 1.1 8.5 ± 1.2 8.7 ± 1.0 Kimi K2 8.2 ± 1.2 7.2 ± 1.6 7.7 ± 1.4 7.2 ± 1.2 7.6 ± 1.4 DeepSeek 8.0 ± 1.3 7.7 ± 1.5 7.5 ± 1.6 7.5 ± 1.1 7.7 ± 1.4 Gemini 3.0 7.5 ± 1.4 7.2 ± 1.7 7.3 ± 1.5 7.8 ± 1.3 7.4 ± 1.5 Wenxin Yiyan 7.7 ± 1.4 7.2 ± 1.7 7.4 ± 1.6 7.5 ± 1.4 7.5 ± 1.5 Grok 3 7.5 ± 1.1 6.8 ± 1.3 6.9 ± 1.4 7.3 ± 1.2 7.1 ± 1.3 Tongyi Qianwen 7.2 ± 1.5 6.5 ± 1.7 6.7 ± 1.6 7.1 ± 1.4 6.9 ± 1.6 Data are presented as mean ± standard deviation (SD). Consultation questions were categorized into drug administration and dosage guidance; management of adverse drug reactions; drug interactions; safety evaluation of specific drugs during pregnancy or lactation; and complex therapeutic decision-making based on specific clinical scenarios. Relevance reflects the extent to which responses directly addressed the consultation question. Accuracy reflects concordance with current clinical guidelines, authoritative references, and accepted pharmacotherapy principles. Usefulness reflects the degree to which responses provided specific, actionable, and clinically applicable recommendations. Empathy reflects the use of supportive, non-judgmental, and patient-centered language appropriate for perinatal medication consultations. Overall score represents the mean of the four evaluation dimensions. LLM, large language model. The performance gap between models widened further in complex clinical scenarios involving comorbidities, polypharmacy, or explicit benefit–risk trade-offs, such as the management of gestational asthma. In these scenarios, GPT-5.1 maintained a relatively high accuracy (mean accuracy 8.8), frequently providing structured responses that addressed both maternal disease control and fetal safety considerations. For consultation questions involving traditional Chinese medicine, domestic models such as Wenxin Yiyan and Tongyi Qianwen demonstrated relative advantages in relevance and usefulness, occasionally outperforming international models by incorporating locally contextualized clinical information. Qualitative Findings from Expert Evaluation Qualitative feedback from the expert panel revealed recurring strengths and limitations across the model outputs,, based on consensus discussion following independent scoring. Commonly identified weaknesses included: (1) Incomplete or overly cautious recommendations, such as labeling relatively safe medications as “not recommended” without adequate benefit–risk contextualization; (2) Excessive use of caveats, which reduced practical usefulness by failing to offer actionable alternatives; (3) Lack of explicit reference to authoritative sources, including clinical guidelines or drug safety databases; and (4) Occasional logical or factual inconsistencies, particularly in responses generated by lower-performing models. The identified strengths included (1) integration of multidisciplinary medical knowledge, spanning pharmacology and obstetrics; (2) clear and structured response formats, such as bullet points and subheadings, which improved readability; and (3) emerging empathetic communication, particularly in higher-performing models, which may help alleviate patient anxiety during perinatal medication consultations. Several of these limitations have potential implications for patient safety. Overly cautious or non-specific advice may contribute to unnecessary discontinuation of essential maternal medications, whereas incomplete risk communication may delay appropriate clinical consultation. These findings demonstrate the importance of professional oversight when applying LLMs to high-risk perinatal medication contexts. Discussion This cross-sectional evaluation provides a systematic comparison of seven leading LLMs in response to real-world perinatal medication consultation queries derived from routine clinical pharmacy practices. The findings suggest that LLMs may serve as adjunctive informational and triage-support tools within pharmacist-led consultation services but are not suitable for independent medication assessment or counseling, particularly in the context of pregnancy and lactation, where medication-related decisions directly affect both maternal and fetal outcomes. In addition, these results extend prior evaluations of LLMs in general healthcare settings, including chronic disease management, medication counseling, and clinical decision support, by focusing on the high-risk and ethically sensitive domains of perinatal pharmacotherapy. Previous studies have shown that LLMs can achieve expert-level performance in structured clinical tasks [ 13 ], such as diabetes education, celiac disease counseling, and general medication advice, particularly when questions are well-defined and evidence-based [ 14 ]. Consistent with these reports, the top-performing models in the present study, particularly GPT-5.1, achieved high scores for accuracy and usefulness, reflecting their ability to synthesize guideline-based recommendations. This performance aligns with the documented strengths of LLMs in rapid information retrieval, integration of heterogeneous data sources, and structured output generation, which may enhance efficiency in clinical pharmacy services without replacing the contextual judgment and accountability of human clinicians [ 15 ]. Nevertheless, performance disparities became more pronounced in clinically complex scenarios, including those involving comorbidities, polypharmacy, or explicit benefit–risk trade-offs. Similar limitations have been reported in the evaluation of LLMs applied to other high-stakes specialties, such as oncology and radiology. In the present study, lower-performing models were more prone to generating factual inaccuracies or overly cautious non-committal recommendations [ 16 ]. In perinatal care, such tendencies are particularly concerning, as excessive risk aversion may lead to undertreatment of maternal conditions and consequently increase the risk of adverse maternal and fetal outcomes. These findings highlight the context-dependent nature of LLM performance and highlight the necessity of rigorous domain-specific evaluation before implementation in high-risk clinical settings [ 13 ]. Notably, medication classes commonly encountered during pregnancy and lactation, including nonsteroidal anti-inflammatory drugs, psychotropic agents, hormonal therapies, and medications requiring individualized benefit–risk assessment, represent scenarios in which unsupervised LLM-generated advice may be especially inappropriate. In such contexts, even minor omissions or disproportionate caution may have clinically meaningful consequences. Several factors may underlie the observed limitations, particularly structural constraints in the available training data. Ethical and practical considerations have resulted in a paucity of randomized controlled trials involving pregnant and lactating populations, with much of the evidence for perinatal medication safety relying on observational studies that are vulnerable to confounding and selection bias [ 17 – 18 ]. Therefore, LLMs trained in such studies may reproduce the existing uncertainty and inconsistency. These challenges are further amplified in settings where high-quality perinatal pharmacological data remain limited. Accordingly, the marginal advantages observed for domestic models in queries related to traditional Chinese medicine likely reflect greater linguistic and contextual alignment with local clinical practice rather than intrinsic model superiority, highlighting the importance of localized data curation to support equitable and safe LLM deployment across healthcare systems [ 19 ]. Empathy emerged as an additional differentiation dimension among the evaluated models. GPT-5.1 consistently employed supportive and non-judgmental language, achieving higher empathy scores and potentially addressing maternal anxiety, which is a clinically relevant concern during the perinatal period. While empathetic communication using digital tools may enhance patient engagement and trust [ 10 ], even the highest-performing models frequently rely on generic expressions. Further prospective research is required to determine whether LLM-generated empathy translates into measurable improvements in patient experience, medication adherence, or clinical outcomes. These findings should be interpreted within the broader context of healthcare access and workforce constraints. Shortages of trained clinical pharmacists and regional disparities in maternal healthcare services continue to pose challenges, particularly in rural and underserved areas [ 20 ]. In this context, LLMs offer a potential means of supporting scalable, low-cost, and timely access to medication-related information, thereby complementing existing perinatal pharmacy services [ 21 – 22 ]. However, this potential can only be realized if LLM deployment is accompanied by rigorous governance frameworks, explicit scope limitations, and sustained human oversight. While domestic models show promise in addressing locally relevant practices, including traditional medicine, substantial improvements in evidence integration, risk stratification, and transparency are required to approach the reliability needed for routine clinical pharmacy support. From a practical implementation perspective, a clear differentiation between professional- and patient-facing applications is essential. Pharmacists may reasonably employ LLMs as supervised decision support or information-triage tools, whereas direct patient use in perinatal settings should be restricted to educational purposes with explicit safeguards against independent medication decision-making. Several limitations of this study should be considered when interpreting these findings. First, the cross-sectional design reflects model performance within a defined time frame, and the results may change as LLMs continue to evolve. Second, single-response sampling may not fully capture intra-model variability; however, this approach was intentionally selected to approximate real-world first-use consultation scenarios. Third, although the consultation questions were derived from routine clinical practice, they may not fully represent the diversity of perinatal medication concerns across different regions or healthcare settings. Finally, while the responses were evaluated by experienced clinical pharmacologists using a blinded framework, reliance on a single expert panel from one institution may limit generalizability. Conclusion Although large language models demonstrate promising performance in structured evaluations of perinatal medication information, their current capabilities do not support independent use in medication decision-making or counseling during pregnancy or lactation. At present, LLMs should be regarded solely as supervised, adjunctive tools that may assist clinical pharmacists in information retrieval and preliminary risk assessment rather than as substitutes for professional judgment. Future development and implementation of LLMs in perinatal care will require continuous integration of authoritative clinical guidelines, high-quality evidence sources, and rigorously curated pharmacotherapy data to improve response accuracy, consistency, and contextual appropriateness. Only through iterative refinement, domain-specific validation, and robust governance frameworks can LLMs evolve to provide meaningful support for perinatal pharmacy services. Accordingly, any integration of LLMs into perinatal medication consultation workflows should prioritize human oversight, a clearly defined scope of use, and regulatory safeguards, ensuring that maternal and fetal safety remain central to clinical decision-making. Declarations Acknowledgments None. Funding This study was supported by the Beijing Municipal Science and Technology Commission (Grant No. 7244462). Competing Interests The authors declare that they have no competing interests. Author Contributions R. W. and Y. L. contributed to the methodology, software, writing, reviewing, and editing. R. W. drafted the manuscript. X. F. and Y. L. contributed to data curation. R. W. and X. F. performed visualization. X. F. was responsible for conceptualization, resources, supervision, project administration, and funding acquisition. All the authors have read and approved the final manuscript. Data Availability The datasets generated and analyzed in the current study are available from the corresponding author upon reasonable request. Ethics Approval The study protocol was reviewed and approved by the Ethics Committee of the Beijing Obstetrics and Gynecology Hospital, Capital Medical University (Approval No.: 2024-KY-035-01). All consultation records were fully anonymized prior to the analysis, and the study was conducted in accordance with the principles of the Declaration of Helsinki. References Jordan S, Bromley R, Damase-Michel C, et al. Breastfeeding, pregnancy, medicines, neurodevelopment, and population databases: the information desert. Int Breastfeed J. 2022;17(1):55. https://doi.org/10.1186/s13006-022-00494-5 . Huybrechts KF, Bateman BT, Hernández-Díaz S. Modern evidence generation on medication effectiveness and safety during pregnancy: study design considerations. Clin Pharmacol Ther. 2025;117:895–909. https://doi.org/10.1002/cpt.3598 . Desaunay P, Eude LG, Dreyfus M, et al. Benefits and risks of antidepressant drugs during pregnancy: a systematic review of meta-analyses. Paediatr Drugs. 2023;25(3):247–65. https://doi.org/10.1007/s40272-023-00561-2 . Howard LM, Molyneaux E, Dennis CL, et al. Non-psychotic mental disorders in the perinatal period. Lancet. 2014;384:1775–88. https://doi.org/10.1016/S0140-6736(14)61276-9 . Elliott RA, Lee CY, Beanland C, et al. Development of a clinical pharmacy model within an Australian home nursing service using co-creation and participatory action research: the Visiting Pharmacist (ViP) study. BMJ Open. 2017;7(11):e018722. https://doi.org/10.1136/bmjopen-2017-018722 . Damkier P, Huybrechts KF, Nordeng H. Big data in the assessment of medication safety in pregnancy: opportunities and challenges. Pediatr Drugs. 2025;27:673–7. https://doi.org/10.1007/s40272-025-00718-1 . Habets PC, van IJzendoorn DG, Vinkers CH, et al. Development and validation of a machine-learning algorithm to predict the relevance of scientific articles within the field of teratology. Reprod Toxicol. 2022;113:150–4. https://doi.org/10.1016/j.reprotox.2022.09.001 . Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44–56. https://doi.org/10.1038/s41591-018-0300-7 . Liu J, Wang C, Liu S. Utility of ChatGPT in clinical practice. J Med Internet Res. 2023;25:e48568. https://doi.org/10.2196/48568 . Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589–96. https://doi.org/10.1001/jamainternmed.2023.1838 . Hu YJ, Said JM, Cheong JLY. Rethinking medication safety in pregnancy and infancy: how target trial emulation and real-world data bridge the evidence gap. J Clin Epidemiol. 2025;181:111747. https://doi.org/10.1016/j.jclinepi.2025.111747 . Kell G, Roberts A, Umansky S, et al. Question answering systems for health professionals at the point of care—a systematic review. J Am Med Inf Assoc. 2024;31(4):1009–24. https://doi.org/10.1093/jamia/ ocae015 . Nori H, King N, McKinney SM et al. Capabilities of GPT-4 on medical challenge problems. arXiv:2303.13375 [Preprint]. 2023. Available from: https://arxiv.org/abs/2303.13375 . Accessed 30 Dec 2025. Sheng B, Guan Z, Lim LL, et al. Large language models for diabetes care: potentials and prospects. Sci Bull. 2024;69(5):583–8. https://doi.org/10.1016/j.scib.2024.01.004 . Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55(12):1–38. https://doi.org/10.1145/3571730 . Ji Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55(12):1–38. https://doi.org/10.1145/3571730 . Jorgensen SCJ, Miljanic S, Tabbara N, et al. Inclusion of pregnant and breastfeeding women in nonobstetrical randomized controlled trials. Am J Obstet Gynecol MFM. 2022;4(6):100700. https://doi.org/10.1016/j.ajogmf.2022.100700 . Jia Y, Wang J, Liu C, et al. The Methodological Quality of Observational Studies Examining the Risk of Pregnancy Drug Use on Congenital Malformations Needs Substantial Improvement: A Cross-Sectional Survey. Drug Saf. 2024;47(11):1171–88. 10.1007/s40264-024-01465-x . De Vries PLM, Baud D, Baggio S, et al. Enhancing perinatal health patient information through ChatGPT—an accuracy study. PEC Innov. 2025;6:100381. https://doi.org/10.1016/j.pecinn.2025.100381 . Mwakawanga DL, Mutagonda RF, Mlyuka HJ, et al. Improving the provision of clinical pharmacy services in low- and middle-income countries: a qualitative study in tertiary health facilities in Tanzania. BMJ Public Health. 2025;3(1):e001776. https://doi.org/10.1136/bmjph-2024-001776 . Grünebaum A, Chervenak FA, Pollet SL, et al. The exciting potential for ChatGPT in obstetrics and gynecology. Am J Obstet Gynecol. 2023;228:696–705. https://doi.org/10.1016/j.ajog.2023.03.009 . Peled T, Sela HY, Weiss A, Grisaru-Granovsky S, et al. Evaluating the validity of ChatGPT responses on common obstetric issues: potential clinical applications and implications. Int J Gynaecol Obstet. 2024;166(3):1127–33. https://doi.org/10.1002/ijgo.15501 . Additional Declarations No competing interests reported. Supplementary Files SupplementaryTableS1.docx Cite Share Download PDF Status: Published Journal Publication published 27 Apr, 2026 Read the published version in International Journal of Clinical Pharmacy → Version 1 posted Editorial decision: Revision requested 09 Mar, 2026 Reviews received at journal 09 Mar, 2026 Reviews received at journal 26 Feb, 2026 Reviewers agreed at journal 26 Feb, 2026 Reviewers agreed at journal 22 Feb, 2026 Reviewers invited by journal 02 Feb, 2026 Editor assigned by journal 27 Jan, 2026 Submission checks completed at journal 27 Jan, 2026 First submitted to journal 26 Jan, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8696873","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":584989732,"identity":"d4f4bbfc-8d8c-46b5-83f0-e4b1a6dffeea","order_by":0,"name":"RAN WANG","email":"","orcid":"","institution":"Beijing Obstetrics and Gynecology Hospital, Capital Medical University /Beijing Maternal and Child Health Care Hospital","correspondingAuthor":false,"prefix":"","firstName":"RAN","middleName":"","lastName":"WANG","suffix":""},{"id":584989733,"identity":"3052fd6d-647a-4719-8f47-4d6ff41f054d","order_by":1,"name":"Yifan Li","email":"","orcid":"","institution":"Beijing Obstetrics and Gynecology Hospital, Capital Medical University /Beijing Maternal and Child Health Care Hospital","correspondingAuthor":false,"prefix":"","firstName":"Yifan","middleName":"","lastName":"Li","suffix":""},{"id":584989734,"identity":"ec27d7d4-5ebe-4496-aee6-d00bca37c371","order_by":2,"name":"Xuewei Feng","email":"","orcid":"","institution":"Beijing Obstetrics and Gynecology Hospital, Capital Medical University /Beijing Maternal and Child Health Care Hospital","correspondingAuthor":false,"prefix":"","firstName":"Xuewei","middleName":"","lastName":"Feng","suffix":""},{"id":584989735,"identity":"dd790f95-35fb-44ea-b865-73c5045ba9d6","order_by":3,"name":"Xin Feng","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA8klEQVRIiWNgGAWjYBAC9gYQacDAw8bM//ABVNAArxaeAxA1MnzsPMwwpcRoYWCwkePnYZMgTgt77+HXPAV3gA7jPVbxo2ZbYgN78zYJhpo7uLXwnEuznGHwDKiFL+1mz7HbiQ08x8okGI49w6nFXiLHzOCDwWGgFgaz24wNQC1AEQnGhsO4bZF/Y2aQANVSDNYCFMGvRYLH+AHEFh4zZogtPAS08OSYMc4Aa2FLlgT6xbiNJ63YIuEYHi3sZ4w/8/w5bC/ff/jghx81t2X72Q9vvPGhBrcWIIBHB5QLIhLwaWBgYP6AX34UjIJRMApGPAAAxRhMtrCA5/EAAAAASUVORK5CYII=","orcid":"","institution":"Beijing Obstetrics and Gynecology Hospital, Capital Medical University /Beijing Maternal and Child Health Care Hospital","correspondingAuthor":true,"prefix":"","firstName":"Xin","middleName":"","lastName":"Feng","suffix":""}],"badges":[],"createdAt":"2026-01-26 05:53:14","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8696873/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8696873/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s11096-026-02138-8","type":"published","date":"2026-04-27T15:58:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":108440277,"identity":"7a1e720a-512f-4233-a7c5-410d420ed222","added_by":"auto","created_at":"2026-05-04 16:35:28","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":368650,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8696873/v1/17188da4-3c17-42ad-86f2-c495a683c2d2.pdf"},{"id":101842201,"identity":"03b90f42-c921-4c25-8afd-508232af5bb5","added_by":"auto","created_at":"2026-02-04 08:44:06","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":14669,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryTableS1.docx","url":"https://assets-eu.researchsquare.com/files/rs-8696873/v1/6b66466072c56c718bded7d5.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Performance Evaluation of Large Language Models in Real-World Perinatal Medication Consultations: A Cross-Sectional Study","fulltext":[{"header":"Impacts on Practice","content":"\u003cul\u003e\n \u003cli\u003eLarge language models may be used by clinical pharmacists as supervised decision-support tools to assist with information retrieval during perinatal medication consultations but should not replace professional clinical judgment or individualized risk\u0026ndash;benefit assessment.\u003c/li\u003e\n \u003cli\u003eThe independent or unsupervised use of LLMs for perinatal medication counseling poses potential patient safety risks, highlighting the need for pharmacist oversight, clear scope limitations, and structured governance in clinical practice.\u003c/li\u003e\n \u003cli\u003eThe integration of LLMs into perinatal pharmacy services may help improve access to medication information in resource-limited settings, provided that models are rigorously validated, regularly updated with authoritative evidence, and embedded within pharmacist-led consultation workflows.\u003c/li\u003e\n\u003c/ul\u003e"},{"header":"Introduction","content":"\u003cp\u003eThe perinatal period, encompassing pregnancy and early postpartum, involves complicated prescribing decisions with significant risks, including teratogenic effects and adverse fetal outcomes, such as congenital anomalies, preterm birth, and neurodevelopmental disorders [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Medication use during pregnancy and lactation requires careful benefit–risk assessment, which is often complicated by limited high-quality evidence and ethical constraints that preclude randomized controlled trials in these populations [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. For example, selective serotonin reuptake inhibitors have been associated with an increased risk of preterm birth, whereas paroxetine exposure has been linked to a higher incidence of fetal cardiac malformations [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. Therefore, clinical decision-making during the perinatal period requires individualized assessment, taking into account disease severity, minimal effective dosing, potential drug–drug interactions, and the risks of untreated maternal conditions. Untreated maternal illnesses, such as mental health disorders, may exacerbate adverse obstetric outcomes, including recurrent or persistent depression [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003cp\u003ePerinatal pharmacotherapy consultations, frequently led by clinical pharmacists, play a critical role in supporting safe medication use during pregnancy and lactation [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Consultations typically rely on the integration of clinical expertise, practice guidelines, and evidence-based drug safety databases. However, accessibility and resource limitations remain substantial challenges, particularly in settings with a shortage of trained clinical pharmacists or barriers to in-person consultation [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e]. Additionally, increasing the clinical workload and interdisciplinary coordination gaps may further limit the timeliness and consistency of medication counseling. While specialized resources, such as the Dutch Teratology Information Service [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], have been developed to improve access to drug safety information, their utility may be constrained by information overload, language barriers, and limited validation in complex clinical scenarios, including polypharmacy and drug interaction assessment.\u003c/p\u003e \u003cp\u003eLarge language models (LLMs), such as ChatGPT, have recently attracted attention in healthcare because of their potential to provide medication-related information and support clinical education and decision making, particularly in environments with limited access to specialist care [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Previous studies have reported favorable clinician perceptions of LLM utility in selected areas, including diabetes management and general medication counseling [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e–\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e]. Nevertheless, important concerns remain regarding data privacy, bias, accuracy, and safety, particularly when LLMs are applied to high-risk clinical contexts. Although LLMs may offer opportunities to enhance access to perinatal medication information, their reliability and appropriateness in perinatal pharmacotherapy, where clinical consequences may affect both maternal and fetal outcomes, require careful evaluation [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eDespite the growing interest in the clinical application of LLMs, their effectiveness in supporting perinatal pharmacotherapy consultations remains insufficiently explored. Most existing evaluations focus on general medical questions, simulated scenarios, or English language settings, with a limited assessment of real-world consultation data [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. In addition, linguistic and contextual biases in the training data may affect LLM performance across different healthcare systems and practice environments. Therefore, robust evaluation using real-world consultation scenarios is essential to determine whether LLMs can be safely integrated as supportive tools within pharmacist-led perinatal medication services.\u003c/p\u003e\n\u003ch3\u003eAim\u003c/h3\u003e\n\u003cp\u003eThis study aimed to evaluate and compare the performance of multiple large language models in responding to real-world perinatal medication consultation queries, with a focus on accuracy, usefulness, relevance, and empathy, and to assess their potential role as supervised adjunctive decision-support tools in clinical pharmacy practice.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003cdiv id=\"Sec4\" class=\"Section3\"\u003e \u003c/div\u003e \u003c/div\u003e\n\n \n\n \n\n \n\n"},{"header":"Method","content":"\u003ch2\u003eStudy Design and Setting\u003c/h2\u003e\u003cp\u003eThis cross-sectional study assessed the performance of multiple large language models (LLMs) in addressing real-world perinatal medication consultation queries, using data derived from a pharmacist-led medication consultation service. The study was conducted using electronic consultation records from the Pharmacy Clinic of Beijing Obstetrics and Gynecology Hospital, Capital Medical University, covering the period April 2014 to April 2024.\u003c/p\u003e\u003ch3\u003eDevelopment of the Consultation Question Set\u003c/h3\u003e\u003cp\u003eThe study data were obtained from the electronic medication consultation records database maintained by the pharmacy clinic. An initial descriptive analysis of 15,280 consultation records was performed to identify frequently encountered consultation topics and commonly used medications. The development of the standardized consultation question set followed a three-step process.\u003c/p\u003e\u003cp\u003eTopic Classification: Consultation queries were categorized into five thematic domains: medication safety during pregnancy and lactation, drug administration and dosage guidance, adverse drug reaction management, drug interactions, and complex clinical decision-making scenarios.\u003c/p\u003e\u003cp\u003eFrequency Ranking: Queries within each thematic domain were ranked according to their frequency of occurrence in routine clinical practice.\u003c/p\u003e\u003cp\u003eSelection of Representative Questions: The three to five most frequent questions within each domain were selected and supplemented by a limited number of low frequency but clinically complex queries to ensure adequate representation of real-world practice. This process resulted in a standardized test set of 64 perinatal medication consultation questions. The composition of the finalized question set is summarized in \u003cb\u003eSupplementary Table S1\u003c/b\u003e.\u003c/p\u003e\u003ch3\u003eSelection of Large Language Models\u003c/h3\u003e\u003cp\u003eSeven representative LLMs that were publicly accessible as of October 2025 were selected for evaluation. Model selection aims to include internationally leading models, widely used domestic models, and emerging high-performance systems, enabling comprehensive horizontal comparison. The evaluated models were classified as follows: 1) international models: GPT-5.1, Grok 3, and Gemini 3.0; 2) domestic models: DeepSeek, Wenxin Yiyan, Kimi K2, and Tongyi Qianwen; and 3) response generation protocols.\u003c/p\u003e\u003cp\u003eA standardized response-generation protocol was applied to ensure consistency across the model outputs. Consultation questions were reformulated using the CO-STAR framework (Context, Objective, Style, Tone, Audience, and Response format) to minimize variability related to the prompt structure. Each model was queried once per question, reflecting a realistic first-use consultation scenario, rather than optimized prompt engineering or repeated sampling. Model responses were collected between October 13 and October 17, 2025. This approach was intended to evaluate the quality and safety of the information generated in isolated consultation scenarios, rather than to simulate real-time clinical decision-making or individualized prescriptions.\u003c/p\u003e\u003ch3\u003ePerformance Evaluation Framework\u003c/h3\u003e\u003cp\u003eA panel of four senior clinical pharmacologists, each with 10–15 years of experience in perinatal pharmacotherapy, independently evaluated all the model-generated responses. Responses were assessed across four predefined dimensions (relevance, accuracy, usefulness, and empathy) using a 10-point Likert scale (1 = completely inadequate; 10 = completely adequate). The overall score for each response was calculated as the mean of the four-dimensional scores. Initial ratings were performed independently, and discrepancies were resolved through panel consensus.\u003c/p\u003e\u003ch2\u003eStatistical Analysis\u003c/h2\u003e\u003cp\u003eStatistical analyses were conducted using the SPSS software (version 26.0). Because the score distributions did not meet the assumptions of normality, non-parametric statistical methods were applied for inferential analyses. Descriptive results are reported as mean ± standard deviation (SD) to facilitate comparison with previous LLM evaluation studies. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC) based on the initial independent ratings prior to consensus discussion. Differences in performance among models were examined using the Kruskal–Wallis H test, followed by Dunn’s post-hoc test for pairwise comparisons, where appropriate. Statistical significance was defined as a two-sided p-value \u0026lt; 0.05.\u003c/p\u003e\u003ch3\u003eEthics Approval\u003c/h3\u003e\u003cp\u003e The study protocol was reviewed and approved by the Ethics Committee of the Beijing Obstetrics and Gynecology Hospital, Capital Medical University (Approval No.: 2024-KY-035-01). All consultation records were fully anonymized prior to the analysis, and the study was conducted in accordance with the principles of the Declaration of Helsinki.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eComposition of the Consultation Question Set\u003c/h2\u003e \u003cp\u003eThe final test set consisted of 64 unique perinatal medication consultation questions derived from an analysis of 15,280 de-identified electronic consultation records collected between April 2014 and April 2024. Based on consultation frequency and clinical characteristics, the questions were categorized into five major domains (\u003cb\u003eSupplementary Table S1)\u003c/b\u003e: 1) safety evaluation of specific drugs during pregnancy or lactation (n\u0026thinsp;=\u0026thinsp;25, 39.1%), 2) drug administration and dosage guidance (n\u0026thinsp;=\u0026thinsp;12, 18.8%), 3) management of adverse drug reactions (n\u0026thinsp;=\u0026thinsp;8, 12.5%), 4) drug interactions (n\u0026thinsp;=\u0026thinsp;7, 10.9%), and 5) complex therapeutic decision-making based on specific clinical scenarios (n\u0026thinsp;=\u0026thinsp;12, 18.8%).\u003c/p\u003e \u003cp\u003eThis distribution reflects the most frequent and clinically relevant medication-related concerns encountered during pharmacist-led perinatal consultations in routine practice.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eOverall Performance of Large Language Models\u003c/h2\u003e \u003cp\u003eA total of 448 model-generated responses (64 questions \u0026times; 7 models) were independently evaluated by an expert panel. Inter-rater reliability for overall scores was excellent, with an ICC of 0.91 (95% CI 0.88\u0026ndash;0.94), indicating high consistency among evaluators.\u003c/p\u003e \u003cp\u003eSignificant differences in overall performance were observed among the seven LLMs (Kruskal\u0026ndash;Wallis test, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001). As shown in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e, GPT-5.1 achieved the highest mean overall score (9.1\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8), demonstrating consistently high performance across relevance, accuracy, usefulness, and empathy, which together reflect task alignment, clinical correctness, practical applicability, and patient-centered communication. Among the domestic models, Kimi K2 ranked highest (8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2), followed by DeepSeek (8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1). In contrast, Tongyi Qianwen showed a comparatively lower overall performance, particularly in terms of accuracy and usefulness.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance scores of large language models across evaluation dimensions\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRelevance\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eUsefulness\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eEmpathy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eOverall score\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e9.2\u0026thinsp;\u0026plusmn;\u0026thinsp;0.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e9.3\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e9.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e9.1\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eKimi K2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e8.8\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeek\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini 3.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eWenxin Yiyan\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGrok 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTongyi Qianwen\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c2\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"6\"\u003eData are presented as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation (SD).\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"6\"\u003eRelevance reflects the degree to which responses addressed the core consultation question.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"6\"\u003eAccuracy reflects concordance with current clinical guidelines, authoritative references, and accepted pharmacotherapy principles.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"6\"\u003eUsefulness reflects the provision of specific, actionable, and clinically applicable recommendations.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"6\"\u003eEmpathy reflects the use of supportive, non-judgmental, and patient-centered language.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"6\"\u003eOverall score represents the mean of the four evaluation dimensions.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"6\"\u003eLLM, large language model.\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eDimension-Specific Performance Analysis\u003c/h2\u003e \u003cp\u003eAcross all evaluated dimensions, accuracy emerged as the primary factor differentiating the model performance. GPT-5.1 achieved the highest accuracy score (9.3\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9), with responses most frequently aligned with current clinical guidelines, authoritative literature, and established pharmacotherapy principles. For example, in queries concerning ibuprofen use during the second and third trimesters, GPT-5.1 consistently identified risks, such as premature closure of the ductus arteriosus, and recommended appropriate alternative analgesics. In contrast, some lower-scoring models provided vague recommendations (e.g., \u0026ldquo;use with caution\u0026rdquo;) or omitted clinically important risk information.\u003c/p\u003e \u003cp\u003eThe relevance scores were consistently high across all models, indicating that most responses directly addressed the core consultation questions without introducing irrelevant content. This finding suggests that basic task understanding is largely preserved across models.\u003c/p\u003e \u003cp\u003eGreater variability was observed for usefulness and empathy. GPT-5.1 (8.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1) and Gemini 3.0 (8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0) achieved the highest empathy scores, frequently using supportive and non-judgmental language that acknowledged patient concerns, such as anxiety related to medication safety during pregnancy, and encouraged appropriate clinical follow-up. The usefulness scores were closely associated with the provision of specific and actionable guidance. High-performing models, including GPT-5.1 and Kimi K2, more often provide tailored recommendations such as explicit instructions for managing missed doses of dydrogesterone ((a synthetic progestogen) during pregnancy, rather than relying on generic advisory statements.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003ePerformance by Question Complexity\u003c/h2\u003e \u003cp\u003eThe model performance varied substantially according to consultation complexity (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). For direct evidence-based questions, such as adverse drug reaction management and dosage guidance, most models achieved satisfactory performance, with mean accuracy scores exceeding 7.5. In contrast, questions related to drug safety during pregnancy and lactation demonstrated lower overall scores, reflecting increased complexity and higher risk of factual deviation in this domain.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003ePerformance scores of large language models by consultation question type\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"7\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\"\u0026plusmn;\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eQuestion Type\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eLLMS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eRelevance\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAccuracy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eUsefulness\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eEmpathy\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eOverall Score\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003e\u003cb\u003eDrug administrat-ion and dosage guidance\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e9.3\u0026thinsp;\u0026plusmn;\u0026thinsp;0.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e9.4\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e9.2\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e9.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e9.2\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eKimi K2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.9\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDeepSeek\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.6\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;0.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGemini 3.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWenxin Yiyan\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGrok 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTongyi Qianwen\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003e\u003cb\u003eManagement of adverse drug reactions\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e9.4\u0026thinsp;\u0026plusmn;\u0026thinsp;0.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e9.5\u0026thinsp;\u0026plusmn;\u0026thinsp;0.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e9.3\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e9.1\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e9.3\u0026thinsp;\u0026plusmn;\u0026thinsp;0.7\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eKimi K2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e9.0\u0026thinsp;\u0026plusmn;\u0026thinsp;0.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;0.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDeepSeek\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;0.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGemini 3.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWenxin Yiyan\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGrok 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTongyi Qianwen\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003e\u003cb\u003eDrug interactions\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e9.1\u0026thinsp;\u0026plusmn;\u0026thinsp;0.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e9.2\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e9.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e9.0\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eKimi K2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDeepSeek\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGemini 3.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWenxin Yiyan\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGrok 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTongyi Qianwen\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003e\u003cb\u003eSafety evaluation of specific drugs during pregnancy or lactation\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e9.0\u0026thinsp;\u0026plusmn;\u0026thinsp;0.8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e9.1\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.9\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eKimi K2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDeepSeek\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGemini 3.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWenxin Yiyan\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGrok 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTongyi Qianwen\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"6\" rowspan=\"7\"\u003e \u003cp\u003e\u003cb\u003eComplex therapeutic decision-making based on specific clinical scenarios\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-5.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.8\u0026thinsp;\u0026plusmn;\u0026thinsp;0.9\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e8.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e8.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e8.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.0\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eKimi K2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.6\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDeepSeek\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e8.0\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGemini 3.0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eWenxin Yiyan\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e7.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e7.4\u0026thinsp;\u0026plusmn;\u0026thinsp;1.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGrok 3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e6.8\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e6.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.3\u0026thinsp;\u0026plusmn;\u0026thinsp;1.2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e7.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.3\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTongyi Qianwen\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c3\"\u003e \u003cp\u003e7.2\u0026thinsp;\u0026plusmn;\u0026thinsp;1.5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c4\"\u003e \u003cp\u003e6.5\u0026thinsp;\u0026plusmn;\u0026thinsp;1.7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c5\"\u003e \u003cp\u003e6.7\u0026thinsp;\u0026plusmn;\u0026thinsp;1.6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c6\"\u003e \u003cp\u003e7.1\u0026thinsp;\u0026plusmn;\u0026thinsp;1.4\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\"\u0026plusmn;\" colname=\"c7\"\u003e \u003cp\u003e6.9\u0026thinsp;\u0026plusmn;\u0026thinsp;1.6\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"7\"\u003eData are presented as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation (SD).\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"7\"\u003eConsultation questions were categorized into drug administration and dosage guidance; management of adverse drug reactions; drug interactions; safety evaluation of specific drugs during pregnancy or lactation; and complex therapeutic decision-making based on specific clinical scenarios.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"7\"\u003eRelevance reflects the extent to which responses directly addressed the consultation question.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"7\"\u003eAccuracy reflects concordance with current clinical guidelines, authoritative references, and accepted pharmacotherapy principles.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"7\"\u003eUsefulness reflects the degree to which responses provided specific, actionable, and clinically applicable recommendations.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"7\"\u003eEmpathy reflects the use of supportive, non-judgmental, and patient-centered language appropriate for perinatal medication consultations.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"7\"\u003eOverall score represents the mean of the four evaluation dimensions.\u003c/td\u003e\u003c/tr\u003e \u003ctr\u003e\u003ctd colspan=\"7\"\u003eLLM, large language model.\u003c/td\u003e\u003c/tr\u003e\u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe performance gap between models widened further in complex clinical scenarios involving comorbidities, polypharmacy, or explicit benefit\u0026ndash;risk trade-offs, such as the management of gestational asthma. In these scenarios, GPT-5.1 maintained a relatively high accuracy (mean accuracy 8.8), frequently providing structured responses that addressed both maternal disease control and fetal safety considerations. For consultation questions involving traditional Chinese medicine, domestic models such as Wenxin Yiyan and Tongyi Qianwen demonstrated relative advantages in relevance and usefulness, occasionally outperforming international models by incorporating locally contextualized clinical information.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eQualitative Findings from Expert Evaluation\u003c/h2\u003e \u003cp\u003eQualitative feedback from the expert panel revealed recurring strengths and limitations across the model outputs,, based on consensus discussion following independent scoring. Commonly identified weaknesses included: (1) Incomplete or overly cautious recommendations, such as labeling relatively safe medications as \u0026ldquo;not recommended\u0026rdquo; without adequate benefit\u0026ndash;risk contextualization; (2) Excessive use of caveats, which reduced practical usefulness by failing to offer actionable alternatives; (3) Lack of explicit reference to authoritative sources, including clinical guidelines or drug safety databases; and (4) Occasional logical or factual inconsistencies, particularly in responses generated by lower-performing models.\u003c/p\u003e \u003cp\u003eThe identified strengths included (1) integration of multidisciplinary medical knowledge, spanning pharmacology and obstetrics; (2) clear and structured response formats, such as bullet points and subheadings, which improved readability; and (3) emerging empathetic communication, particularly in higher-performing models, which may help alleviate patient anxiety during perinatal medication consultations.\u003c/p\u003e \u003cp\u003eSeveral of these limitations have potential implications for patient safety. Overly cautious or non-specific advice may contribute to unnecessary discontinuation of essential maternal medications, whereas incomplete risk communication may delay appropriate clinical consultation. These findings demonstrate the importance of professional oversight when applying LLMs to high-risk perinatal medication contexts.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis cross-sectional evaluation provides a systematic comparison of seven leading LLMs in response to real-world perinatal medication consultation queries derived from routine clinical pharmacy practices. The findings suggest that LLMs may serve as adjunctive informational and triage-support tools within pharmacist-led consultation services but are not suitable for independent medication assessment or counseling, particularly in the context of pregnancy and lactation, where medication-related decisions directly affect both maternal and fetal outcomes.\u003c/p\u003e \u003cp\u003eIn addition, these results extend prior evaluations of LLMs in general healthcare settings, including chronic disease management, medication counseling, and clinical decision support, by focusing on the high-risk and ethically sensitive domains of perinatal pharmacotherapy. Previous studies have shown that LLMs can achieve expert-level performance in structured clinical tasks [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], such as diabetes education, celiac disease counseling, and general medication advice, particularly when questions are well-defined and evidence-based [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. Consistent with these reports, the top-performing models in the present study, particularly GPT-5.1, achieved high scores for accuracy and usefulness, reflecting their ability to synthesize guideline-based recommendations. This performance aligns with the documented strengths of LLMs in rapid information retrieval, integration of heterogeneous data sources, and structured output generation, which may enhance efficiency in clinical pharmacy services without replacing the contextual judgment and accountability of human clinicians [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eNevertheless, performance disparities became more pronounced in clinically complex scenarios, including those involving comorbidities, polypharmacy, or explicit benefit\u0026ndash;risk trade-offs. Similar limitations have been reported in the evaluation of LLMs applied to other high-stakes specialties, such as oncology and radiology. In the present study, lower-performing models were more prone to generating factual inaccuracies or overly cautious non-committal recommendations [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. In perinatal care, such tendencies are particularly concerning, as excessive risk aversion may lead to undertreatment of maternal conditions and consequently increase the risk of adverse maternal and fetal outcomes. These findings highlight the context-dependent nature of LLM performance and highlight the necessity of rigorous domain-specific evaluation before implementation in high-risk clinical settings [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e]. Notably, medication classes commonly encountered during pregnancy and lactation, including nonsteroidal anti-inflammatory drugs, psychotropic agents, hormonal therapies, and medications requiring individualized benefit\u0026ndash;risk assessment, represent scenarios in which unsupervised LLM-generated advice may be especially inappropriate. In such contexts, even minor omissions or disproportionate caution may have clinically meaningful consequences.\u003c/p\u003e \u003cp\u003eSeveral factors may underlie the observed limitations, particularly structural constraints in the available training data. Ethical and practical considerations have resulted in a paucity of randomized controlled trials involving pregnant and lactating populations, with much of the evidence for perinatal medication safety relying on observational studies that are vulnerable to confounding and selection bias [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. Therefore, LLMs trained in such studies may reproduce the existing uncertainty and inconsistency. These challenges are further amplified in settings where high-quality perinatal pharmacological data remain limited. Accordingly, the marginal advantages observed for domestic models in queries related to traditional Chinese medicine likely reflect greater linguistic and contextual alignment with local clinical practice rather than intrinsic model superiority, highlighting the importance of localized data curation to support equitable and safe LLM deployment across healthcare systems [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eEmpathy emerged as an additional differentiation dimension among the evaluated models. GPT-5.1 consistently employed supportive and non-judgmental language, achieving higher empathy scores and potentially addressing maternal anxiety, which is a clinically relevant concern during the perinatal period. While empathetic communication using digital tools may enhance patient engagement and trust [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], even the highest-performing models frequently rely on generic expressions. Further prospective research is required to determine whether LLM-generated empathy translates into measurable improvements in patient experience, medication adherence, or clinical outcomes.\u003c/p\u003e \u003cp\u003eThese findings should be interpreted within the broader context of healthcare access and workforce constraints. Shortages of trained clinical pharmacists and regional disparities in maternal healthcare services continue to pose challenges, particularly in rural and underserved areas [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. In this context, LLMs offer a potential means of supporting scalable, low-cost, and timely access to medication-related information, thereby complementing existing perinatal pharmacy services [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. However, this potential can only be realized if LLM deployment is accompanied by rigorous governance frameworks, explicit scope limitations, and sustained human oversight. While domestic models show promise in addressing locally relevant practices, including traditional medicine, substantial improvements in evidence integration, risk stratification, and transparency are required to approach the reliability needed for routine clinical pharmacy support.\u003c/p\u003e \u003cp\u003eFrom a practical implementation perspective, a clear differentiation between professional- and patient-facing applications is essential. Pharmacists may reasonably employ LLMs as supervised decision support or information-triage tools, whereas direct patient use in perinatal settings should be restricted to educational purposes with explicit safeguards against independent medication decision-making.\u003c/p\u003e \u003cp\u003eSeveral limitations of this study should be considered when interpreting these findings. First, the cross-sectional design reflects model performance within a defined time frame, and the results may change as LLMs continue to evolve. Second, single-response sampling may not fully capture intra-model variability; however, this approach was intentionally selected to approximate real-world first-use consultation scenarios. Third, although the consultation questions were derived from routine clinical practice, they may not fully represent the diversity of perinatal medication concerns across different regions or healthcare settings. Finally, while the responses were evaluated by experienced clinical pharmacologists using a blinded framework, reliance on a single expert panel from one institution may limit generalizability.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eAlthough large language models demonstrate promising performance in structured evaluations of perinatal medication information, their current capabilities do not support independent use in medication decision-making or counseling during pregnancy or lactation. At present, LLMs should be regarded solely as supervised, adjunctive tools that may assist clinical pharmacists in information retrieval and preliminary risk assessment rather than as substitutes for professional judgment.\u003c/p\u003e \u003cp\u003e Future development and implementation of LLMs in perinatal care will require continuous integration of authoritative clinical guidelines, high-quality evidence sources, and rigorously curated pharmacotherapy data to improve response accuracy, consistency, and contextual appropriateness. Only through iterative refinement, domain-specific validation, and robust governance frameworks can LLMs evolve to provide meaningful support for perinatal pharmacy services.\u003c/p\u003e \u003cp\u003eAccordingly, any integration of LLMs into perinatal medication consultation workflows should prioritize human oversight, a clearly defined scope of use, and regulatory safeguards, ensuring that maternal and fetal safety remain central to clinical decision-making.\u003c/p\u003e "},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgments\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNone.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was supported by the Beijing Municipal Science and Technology Commission (Grant No. 7244462).\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e\n\n\u003cp\u003eAuthor Contributions\u003c/p\u003e\n\u003cp\u003eR. W. and Y. L. contributed to the methodology, software, writing, reviewing, and editing. R. W. drafted the manuscript. X. F. and Y. L. contributed to data curation. R. W. and X. F. performed visualization. X. F. was responsible for conceptualization, resources, supervision, project administration, and funding acquisition. All the authors have read and approved the final manuscript.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets generated and analyzed in the current study are available from the corresponding author upon reasonable request.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eEthics Approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe study protocol was reviewed and approved by the Ethics Committee of the Beijing Obstetrics and Gynecology Hospital, Capital Medical University (Approval No.: 2024-KY-035-01). All consultation records were fully anonymized prior to the analysis, and the study was conducted in accordance with the principles of the Declaration of Helsinki.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eJordan S, Bromley R, Damase-Michel C, et al. Breastfeeding, pregnancy, medicines, neurodevelopment, and population databases: the information desert. Int Breastfeed J. 2022;17(1):55. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/s13006-022-00494-5\u003c/span\u003e\u003cspan address=\"10.1186/s13006-022-00494-5\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuybrechts KF, Bateman BT, Hern\u0026aacute;ndez-D\u0026iacute;az S. Modern evidence generation on medication effectiveness and safety during pregnancy: study design considerations. Clin Pharmacol Ther. 2025;117:895\u0026ndash;909. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/cpt.3598\u003c/span\u003e\u003cspan address=\"10.1002/cpt.3598\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDesaunay P, Eude LG, Dreyfus M, et al. Benefits and risks of antidepressant drugs during pregnancy: a systematic review of meta-analyses. Paediatr Drugs. 2023;25(3):247\u0026ndash;65. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s40272-023-00561-2\u003c/span\u003e\u003cspan address=\"10.1007/s40272-023-00561-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHoward LM, Molyneaux E, Dennis CL, et al. Non-psychotic mental disorders in the perinatal period. Lancet. 2014;384:1775\u0026ndash;88. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/S0140-6736(14)61276-9\u003c/span\u003e\u003cspan address=\"10.1016/S0140-6736(14)61276-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eElliott RA, Lee CY, Beanland C, et al. Development of a clinical pharmacy model within an Australian home nursing service using co-creation and participatory action research: the Visiting Pharmacist (ViP) study. BMJ Open. 2017;7(11):e018722. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1136/bmjopen-2017-018722\u003c/span\u003e\u003cspan address=\"10.1136/bmjopen-2017-018722\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDamkier P, Huybrechts KF, Nordeng H. Big data in the assessment of medication safety in pregnancy: opportunities and challenges. Pediatr Drugs. 2025;27:673\u0026ndash;7. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/s40272-025-00718-1\u003c/span\u003e\u003cspan address=\"10.1007/s40272-025-00718-1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHabets PC, van IJzendoorn DG, Vinkers CH, et al. Development and validation of a machine-learning algorithm to predict the relevance of scientific articles within the field of teratology. Reprod Toxicol. 2022;113:150\u0026ndash;4. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.reprotox.2022.09.001\u003c/span\u003e\u003cspan address=\"10.1016/j.reprotox.2022.09.001\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTopol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25:44\u0026ndash;56. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41591-018-0300-7\u003c/span\u003e\u003cspan address=\"10.1038/s41591-018-0300-7\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu J, Wang C, Liu S. Utility of ChatGPT in clinical practice. J Med Internet Res. 2023;25:e48568. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.2196/48568\u003c/span\u003e\u003cspan address=\"10.2196/48568\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAyers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589\u0026ndash;96. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1001/jamainternmed.2023.1838\u003c/span\u003e\u003cspan address=\"10.1001/jamainternmed.2023.1838\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHu YJ, Said JM, Cheong JLY. Rethinking medication safety in pregnancy and infancy: how target trial emulation and real-world data bridge the evidence gap. J Clin Epidemiol. 2025;181:111747. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.jclinepi.2025.111747\u003c/span\u003e\u003cspan address=\"10.1016/j.jclinepi.2025.111747\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKell G, Roberts A, Umansky S, et al. Question answering systems for health professionals at the point of care\u0026mdash;a systematic review. J Am Med Inf Assoc. 2024;31(4):1009\u0026ndash;24. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/jamia/ ocae015\u003c/span\u003e\u003cspan address=\"10.1093/jamia/ ocae015\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNori H, King N, McKinney SM et al. Capabilities of GPT-4 on medical challenge problems. arXiv:2303.13375 [Preprint]. 2023. Available from: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2303.13375\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2303.13375\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Accessed 30 Dec 2025.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSheng B, Guan Z, Lim LL, et al. Large language models for diabetes care: potentials and prospects. Sci Bull. 2024;69(5):583\u0026ndash;8. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.scib.2024.01.004\u003c/span\u003e\u003cspan address=\"10.1016/j.scib.2024.01.004\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJi Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55(12):1\u0026ndash;38. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/3571730\u003c/span\u003e\u003cspan address=\"10.1145/3571730\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJi Z, Lee N, Frieske R, et al. Survey of hallucination in natural language generation. ACM Comput Surv. 2023;55(12):1\u0026ndash;38. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/3571730\u003c/span\u003e\u003cspan address=\"10.1145/3571730\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJorgensen SCJ, Miljanic S, Tabbara N, et al. Inclusion of pregnant and breastfeeding women in nonobstetrical randomized controlled trials. Am J Obstet Gynecol MFM. 2022;4(6):100700. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.ajogmf.2022.100700\u003c/span\u003e\u003cspan address=\"10.1016/j.ajogmf.2022.100700\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJia Y, Wang J, Liu C, et al. The Methodological Quality of Observational Studies Examining the Risk of Pregnancy Drug Use on Congenital Malformations Needs Substantial Improvement: A Cross-Sectional Survey. Drug Saf. 2024;47(11):1171\u0026ndash;88. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s40264-024-01465-x\u003c/span\u003e\u003cspan address=\"10.1007/s40264-024-01465-x\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDe Vries PLM, Baud D, Baggio S, et al. Enhancing perinatal health patient information through ChatGPT\u0026mdash;an accuracy study. PEC Innov. 2025;6:100381. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.pecinn.2025.100381\u003c/span\u003e\u003cspan address=\"10.1016/j.pecinn.2025.100381\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMwakawanga DL, Mutagonda RF, Mlyuka HJ, et al. Improving the provision of clinical pharmacy services in low- and middle-income countries: a qualitative study in tertiary health facilities in Tanzania. BMJ Public Health. 2025;3(1):e001776. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1136/bmjph-2024-001776\u003c/span\u003e\u003cspan address=\"10.1136/bmjph-2024-001776\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGr\u0026uuml;nebaum A, Chervenak FA, Pollet SL, et al. The exciting potential for ChatGPT in obstetrics and gynecology. Am J Obstet Gynecol. 2023;228:696\u0026ndash;705. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/j.ajog.2023.03.009\u003c/span\u003e\u003cspan address=\"10.1016/j.ajog.2023.03.009\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePeled T, Sela HY, Weiss A, Grisaru-Granovsky S, et al. Evaluating the validity of ChatGPT responses on common obstetric issues: potential clinical applications and implications. Int J Gynaecol Obstet. 2024;166(3):1127\u0026ndash;33. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/ijgo.15501\u003c/span\u003e\u003cspan address=\"10.1002/ijgo.15501\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"international-journal-of-clinical-pharmacy","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijcp","sideBox":"Learn more about [International Journal of Clinical Pharmacy](https://www.springer.com/journal/11096)","snPcode":"11096","submissionUrl":"https://submission.nature.com/new-submission/11096/3","title":"International Journal of Clinical Pharmacy","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Large language models, Perinatal pharmacotherapy, Medication consultation, Clinical pharmacy practice, Real-world evidence","lastPublishedDoi":"10.21203/rs.3.rs-8696873/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8696873/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eIntroduction\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePerinatal medication consultation is a core clinical pharmacy service that involves a complex benefit–risk assessment for both maternal and fetal safety. Large language models (LLMs) have emerged as potential tools to improve access to medication information, yet their performance and safety in real-world, pharmacist-led perinatal consultation settings, particularly in non-English contexts, remain insufficiently evaluated.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAim\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo evaluate and compare the performance of multiple advanced large language models in addressing real-world Chinese perinatal medication consultation queries and to assess their potential role as supervised adjunctive tools within clinical pharmacy services.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethod\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis cross-sectional study evaluated seven LLMs using real-world clinical data from pharmacist-led medication consultations at the Pharmacy Clinic of the Beijing Obstetrics and Gynecology Hospital, Capital Medical University. A standardized test set of 64 perinatal medication consultation questions was developed from 15,280 electronic consultation records collected between April 2014 and April 2024. The evaluated models included international (GPT-5.1, Grok 3, Gemini 3.0) and domestic (DeepSeek, Wenxin Yiyan, Kimi K2, Tongyi Qianwen) models. Senior clinical pharmacologists independently assessed responses across four dimensions—relevance, accuracy, usefulness, and empathy—using a 10-point Likert scale. The results are summarized as mean ± SD, and between-model differences were analyzed using non-parametric statistical tests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAmong the 448 model-generated responses, inter-rater consistency was excellent (ICC = 0.91, 95% CI 0.88–0.94). Significant differences in the overall performance were observed among the models (p \u0026lt; 0.001). GPT-5.1 achieved the highest mean total score (9.1 ± 0.8), outperforming all other models (all p \u0026lt; 0.01), followed by Kimi K2 and DeepSeek. Accuracy was the primary determinant of performance differences, with GPT-5.1 showing the highest accuracy score (9.2 ± 0.7). Performance gaps were more pronounced in complex clinical scenarios involving comorbidities or benefit–risk trade-offs, whereas domestic models demonstrated relative advantages in consultations involving traditional Chinese medicine.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eLLMs have demonstrated variable performance in response to perinatal medication consultation queries. While high-performing models show the potential to support pharmacist-led perinatal medication consultations by improving access to information, their current performance supports use only as supervised, adjunctive decision-support tools, rather than as independent sources of medication counseling. Careful governance, human oversight, and further validation of safety and reliability are required before broader integration into perinatal clinical pharmacy practices.\u003c/p\u003e","manuscriptTitle":"Performance Evaluation of Large Language Models in Real-World Perinatal Medication Consultations: A Cross-Sectional Study","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-04 08:44:01","doi":"10.21203/rs.3.rs-8696873/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-03-09T13:28:06+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-09T06:06:33+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-26T18:50:19+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"90026501922440595199676412618654731942","date":"2026-02-26T15:29:43+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"289261173083770889983412887178305118294","date":"2026-02-23T00:56:46+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-02-02T18:36:23+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-01-27T16:07:24+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-01-27T16:02:15+00:00","index":"","fulltext":""},{"type":"submitted","content":"International Journal of Clinical Pharmacy","date":"2026-01-26T05:37:52+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"international-journal-of-clinical-pharmacy","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijcp","sideBox":"Learn more about [International Journal of Clinical Pharmacy](https://www.springer.com/journal/11096)","snPcode":"11096","submissionUrl":"https://submission.nature.com/new-submission/11096/3","title":"International Journal of Clinical Pharmacy","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"f717f975-43c1-46ff-ba1c-b1bff36cd63b","owner":[],"postedDate":"February 4th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-05-04T16:35:20+00:00","versionOfRecord":{"articleIdentity":"rs-8696873","link":"https://doi.org/10.1007/s11096-026-02138-8","journal":{"identity":"international-journal-of-clinical-pharmacy","isVorOnly":false,"title":"International Journal of Clinical Pharmacy"},"publishedOn":"2026-04-27 15:58:00","publishedOnDateReadable":"April 27th, 2026"},"versionCreatedAt":"2026-02-04 08:44:01","video":"","vorDoi":"10.1007/s11096-026-02138-8","vorDoiUrl":"https://doi.org/10.1007/s11096-026-02138-8","workflowStages":[]},"version":"v1","identity":"rs-8696873","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8696873","identity":"rs-8696873","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.