A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions

doi:10.21203/rs.3.rs-5128451/v2

A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions

2025 · doi:10.21203/rs.3.rs-5128451/v2

preprint OA: closed

Full text JSON View at publisher

Full text 19,649 characters · extracted from preprint-html · click to expand

A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions Asma Musabah Alkalbani, Ahmed Salim Alrawahi, Ahmad Salah, Venus Haghighi, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5128451/v2 This work is licensed under a CC BY 4.0 License Status: Posted Version 2 posted You are reading this latest preprint version Show more versions Abstract Background: Large Language Models (LLMs) are one of the artificial intelligence (AI) technologies used to understand and generate text, summarize information, and comprehend contextual cues. LLMs have been increasingly used by researchers in various medical applications, but their effectiveness and limitations are still uncertain, especially across various medical specialties. Objective: This review evaluates recent literature on how LLMs are utilized in research studies across 19 medical specialties. It also explores the challenges involved and suggests areas for future research focus. Methods: Two researchers performed literature searches in PubMed, Web of Science and Scopus to identify published literature from January 2021 to March 2024. The studies included the usage of LLM on performing medical tasks. Data was extracted and analyzed by five reviewers. To assess risk of bias, quality assessment was performed using the revised tool for the quality assessment of artificial intelligence-centered diagnostic accuracy studies (QUADAS-AI). Results: Results were synthesized through categorical analysis of evaluation metrics, impact types, and validation approaches across medical specialties. A total of 84 studies were included in this review and mainly originated from two countries; USA (35/84) and China (16/84). Although reviewed LLM applications spread across 19 medical specialties, multi-specialty applications were demonstrated in 22 studies. Various aims for using LLMs include clinical natural language processing (31/84), supporting medical decision (20/84), medical education (15/84), diagnoses (15/84), patient management and patient engagement (3/84). GPT-based and BERT-based LLMs are most used in (83/84) studies. Despite reported positive impacts such as improved efficiency and diagnostic accuracy, challenges related to reliability, accuracy and ethics remain. The overall risk of bias was low in 72 studies, high in 11 studies and not clear in 3 studies. Conclusion: GPT-based and BERT-based LLMs dominate medical specialty applications, with over 98.8% of reviewed studies using these models. Despite their potential benefits in medical process efficiency and diagnostics, a key finding from challenges regarding accuracy was the substantial variability in performance among the LLMs. For instance, LLMs' accuracy ranged from 3% in diagnostic support to over 90% in some clinical NLP tasks. Heterogeneity in the utilization of LLMs across diverse medical tasks and contexts prevented meaningful meta-analysis, as the studies lacked standardized methodologies, outcome measures, and implementation approaches. Therefore, room for improvement remains wide for developing domain-specific LLMs using medical data and establishing validation standards to ensure reliability and effectiveness. Health sciences/Health care Physical sciences/Mathematics and computing/Computer science Physical sciences/Mathematics and computing/Information technology Artificial intelligence (AI) clinical decision support systems large language models (LLMs) clinical NLP medical specialties Full Text Additional Declarations The authors declare no competing interests. Supplementary Files appendixtable011.docx Characteristics of the reviewed studies appendixtable021.docx Detailed LLM performance evaluation metrics. appendixtable031.docx Risk of Bias Assessment. appendix.docx PRISMA (Preferred Reporting Items for a Systematic Review and Meta-Analyses) 2020 checklist. 7055510530601SP1.docx Modified QUADAS-AI framework 7055510530591SP1.docx Query strings of PubMed, Web of Science, Scopus. Cite Share Download PDF Status: Posted Version 2 posted You are reading this latest preprint version Show more versions Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5128451","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":382551585,"identity":"4fd1913f-d95d-4ddb-b4f3-52c8d1455f4b","order_by":0,"name":"Asma Musabah Alkalbani","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABFElEQVRIiWNgGAWjYBADGQkG5jYgLSEH5vIQoYVHgoERrMWYZC0MiQ2EtPD3n078dIPBjkey/WDbg587LNLXticwPnjbxiBvcAC7FokbuZulcxiSeaR5EtsNe89I5G4784DZcG4bg+EGHFoYbvBuAGph5pFjSGyT4G0DarmRwCbN28bAiEuL/Pmzm3/nMNTzyPE/bJP82yaRbnYjgf03UIs9Li0GB3K3AW05zCMtkdgGNFwiAaiFjRmoJRGXFsMbuduscwyO80jOeNhuLNsmYbjtzMNmyTnnJJJn4tAiB3TY7ZyKajmJ88nHHr5tq5M3O5588MObMhvbPlzehzgPhQeOGgl86jFAAkmqR8EoGAWjYPgDAKY+WpCxQws3AAAAAElFTkSuQmCC","orcid":"","institution":"Macquarie University","correspondingAuthor":true,"prefix":"","firstName":"Asma","middleName":"Musabah","lastName":"Alkalbani","suffix":""},{"id":382551586,"identity":"901f01a7-eb9e-4813-b950-19341da7d5d5","order_by":1,"name":"Ahmed Salim Alrawahi","email":"","orcid":"","institution":"University of Technology and Applied Sciences","correspondingAuthor":false,"prefix":"","firstName":"Ahmed","middleName":"Salim","lastName":"Alrawahi","suffix":""},{"id":382551587,"identity":"50bf0018-302b-4db7-89e8-6d2f1dd48cc1","order_by":2,"name":"Ahmad Salah","email":"","orcid":"","institution":"University of Technology and Applied Sciences","correspondingAuthor":false,"prefix":"","firstName":"Ahmad","middleName":"","lastName":"Salah","suffix":""},{"id":382551588,"identity":"4e1b35e3-1d14-4b50-bc06-15337e458565","order_by":3,"name":"Venus Haghighi","email":"","orcid":"","institution":"Macquarie University","correspondingAuthor":false,"prefix":"","firstName":"Venus","middleName":"","lastName":"Haghighi","suffix":""},{"id":382551589,"identity":"a0bf4f52-5621-4aa5-906b-3a3102d6c8b9","order_by":4,"name":"Yang Zhang","email":"","orcid":"","institution":"Macquarie University","correspondingAuthor":false,"prefix":"","firstName":"Yang","middleName":"","lastName":"Zhang","suffix":""},{"id":382551590,"identity":"ccef8394-c9ef-43fd-813b-ed148dc3d539","order_by":5,"name":"Salam Alkindi","email":"","orcid":"","institution":"Sultan Qaboos University","correspondingAuthor":false,"prefix":"","firstName":"Salam","middleName":"","lastName":"Alkindi","suffix":""},{"id":382551591,"identity":"ead248ed-eae0-40e1-bd1f-0e7ba348a55d","order_by":6,"name":"Quan Z Sheng","email":"","orcid":"","institution":"Macquarie University","correspondingAuthor":false,"prefix":"","firstName":"Quan","middleName":"Z","lastName":"Sheng","suffix":""}],"badges":[],"createdAt":"2024-09-21 11:44:26","currentVersionCode":2,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-5128451/v2","doiUrl":"https://doi.org/10.21203/rs.3.rs-5128451/v2","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":80756630,"identity":"506aa210-8d79-44f7-ac43-3d6c5dc4a410","added_by":"auto","created_at":"2025-04-16 18:03:04","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":670731,"visible":true,"origin":"","legend":"","description":"","filename":"RevisedManuscriptclean.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5128451/v2_covered_d6265bec-59be-40f1-81b1-6a29324a1aad.pdf"},{"id":80754679,"identity":"fbe72c71-2d56-4dda-b931-86619dbc1cb1","added_by":"auto","created_at":"2025-04-16 17:31:01","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":55960,"visible":true,"origin":"","legend":"\u003cp\u003eCharacteristics of the reviewed studies\u003c/p\u003e","description":"","filename":"appendixtable011.docx","url":"https://assets-eu.researchsquare.com/files/rs-5128451/v2/34abaeaf010462d185360cfd.docx"},{"id":80755377,"identity":"4af5ea19-b8bd-46c8-83a4-3e7ffc502a52","added_by":"auto","created_at":"2025-04-16 17:39:01","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":32397,"visible":true,"origin":"","legend":"\u003cp\u003eDetailed LLM performance evaluation metrics.\u003c/p\u003e","description":"","filename":"appendixtable021.docx","url":"https://assets-eu.researchsquare.com/files/rs-5128451/v2/866beafd10e982924a6af6b5.docx"},{"id":80754682,"identity":"aeea6f98-d6b4-4567-b4dd-b266b9aaa870","added_by":"auto","created_at":"2025-04-16 17:31:01","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":46842,"visible":true,"origin":"","legend":"\u003cp\u003eRisk of Bias Assessment.\u003c/p\u003e","description":"","filename":"appendixtable031.docx","url":"https://assets-eu.researchsquare.com/files/rs-5128451/v2/bd01cf7518bc11083a2be73b.docx"},{"id":80755382,"identity":"0780c310-a955-493a-81d6-f3c41649cf25","added_by":"auto","created_at":"2025-04-16 17:39:02","extension":"docx","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":33279,"visible":true,"origin":"","legend":"\u003cp\u003ePRISMA (Preferred Reporting Items for a Systematic Review and Meta-Analyses) 2020 checklist.\u003c/p\u003e","description":"","filename":"appendix.docx","url":"https://assets-eu.researchsquare.com/files/rs-5128451/v2/1ffb22812f905c4de22327ca.docx"},{"id":80755760,"identity":"580af15e-65b8-4819-bd3b-5058864ecbd1","added_by":"auto","created_at":"2025-04-16 17:47:02","extension":"docx","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":26969,"visible":true,"origin":"","legend":"\u003cp\u003eModified QUADAS-AI framework\u003c/p\u003e","description":"","filename":"7055510530601SP1.docx","url":"https://assets-eu.researchsquare.com/files/rs-5128451/v2/be47c8efd3bd88b3f78f582b.docx"},{"id":80754685,"identity":"cceb9f32-0876-4850-8563-268b28507a60","added_by":"auto","created_at":"2025-04-16 17:31:02","extension":"docx","order_by":6,"title":"","display":"","copyAsset":false,"role":"supplement","size":18147,"visible":true,"origin":"","legend":"\u003cp\u003eQuery strings of PubMed, Web of Science, Scopus.\u003c/p\u003e","description":"","filename":"7055510530591SP1.docx","url":"https://assets-eu.researchsquare.com/files/rs-5128451/v2/4ec677afb88840790a551982.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Artificial intelligence (AI), clinical decision support systems, large language models (LLMs), clinical NLP, medical specialties","lastPublishedDoi":"10.21203/rs.3.rs-5128451/v2","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5128451/v2","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e\u003cstrong\u003eBackground: \u003c/strong\u003eLarge Language Models (LLMs) are one of the artificial intelligence (AI) technologies used to understand and generate text, summarize information, and comprehend contextual cues. LLMs have been increasingly used by researchers in various medical applications, but their effectiveness and limitations are still uncertain, especially across various medical specialties.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eObjective: \u003c/strong\u003eThis review evaluates recent literature on how LLMs are utilized in research studies across 19 medical specialties. It also explores the challenges involved and suggests areas for future research focus.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eMethods:\u003c/strong\u003e Two researchers performed literature searches in PubMed, Web of Science and Scopus to identify published literature from January 2021 to March 2024. The studies included the usage of LLM on performing medical tasks. \u0026nbsp;\u0026nbsp;Data was extracted and analyzed by five reviewers. To assess risk of bias, quality assessment was performed using the revised tool for the quality assessment of artificial intelligence-centered diagnostic accuracy studies (QUADAS-AI).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eResults:\u003c/strong\u003e Results were synthesized through categorical analysis of evaluation metrics, impact types, and validation approaches across medical specialties. \u003cu\u003eA total of 84\u003c/u\u003e studies were included in this review and mainly originated from two countries; USA (35/84) and China (16/84). Although reviewed LLM applications spread across 19 medical specialties, multi-specialty applications were demonstrated in 22 studies. Various aims for using LLMs include clinical natural language processing (31/84), supporting medical decision (20/84), medical education (15/84), diagnoses (15/84), patient management and patient engagement (3/84). GPT-based and BERT-based LLMs are most used in (83/84) studies. Despite reported positive impacts such as improved efficiency and diagnostic accuracy, challenges related to reliability, accuracy and ethics remain. The overall risk of bias was low in 72 studies, high in 11 studies and not clear in 3 studies.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConclusion:\u003c/strong\u003e GPT-based and BERT-based LLMs dominate medical specialty applications, with over 98.8% of reviewed studies using these models. Despite their potential benefits in medical process efficiency and diagnostics, a key finding from challenges regarding accuracy was the substantial variability in performance among the LLMs. For instance, LLMs' accuracy ranged from 3% in diagnostic support to over 90% in some clinical NLP tasks. Heterogeneity in the utilization of LLMs across diverse medical tasks and contexts prevented meaningful meta-analysis, as the studies lacked standardized methodologies, outcome measures, and implementation approaches. Therefore, room for improvement remains wide for developing domain-specific LLMs using medical data and establishing validation standards to ensure reliability and effectiveness.\u003c/p\u003e","manuscriptTitle":"A Systematic Review of Large Language Models in Medical Specialties: Applications, Challenges and Future Directions","msid":"","msnumber":"","nonDraftVersions":[{"code":2,"date":"2025-04-16 17:30:57","doi":"10.21203/rs.3.rs-5128451/v2","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}},{"code":1,"date":"2024-11-25 21:39:24","doi":"10.21203/rs.3.rs-5128451/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"929d5c1a-bc29-4cc8-8fe3-a37f495c946c","owner":[],"postedDate":"April 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":40743554,"name":"Health sciences/Health care"},{"id":40743555,"name":"Physical sciences/Mathematics and computing/Computer science"},{"id":40743556,"name":"Physical sciences/Mathematics and computing/Information technology"}],"tags":[],"updatedAt":"2024-12-16T08:39:05+00:00","versionOfRecord":[],"versionCreatedAt":"2025-04-16 17:30:57","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v2","identity":"rs-5128451","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5128451","identity":"rs-5128451","version":["v2"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00