How Far Have Large Language Models Advanced in Ophthalmology? A Systematic Review of Their Development, Evaluation, and Readiness for Clinical Use

preprint OA: closed
Full text JSON View at publisher
Full text 19,171 characters · extracted from preprint-html · click to expand
How Far Have Large Language Models Advanced in Ophthalmology? A Systematic Review of Their Development, Evaluation, and Readiness for Clinical Use | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Systematic Review How Far Have Large Language Models Advanced in Ophthalmology? A Systematic Review of Their Development, Evaluation, and Readiness for Clinical Use Hyunjae Kim, Yu Yin, Zhiyuan Cao, Chen Liu, Anran Li, Zhen Chen, and 15 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8819770/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Large language models (LLMs) are rapidly transforming ophthalmology, with expanding applications in patient care, clinical documentation, and medical education. Recent studies span a wide range of use cases, from early text-only applications to emerging multimodal systems that integrate ophthalmic images to support diagnosis and generate assessment and treatment plans. Amid this rapid progress, it is critical for both researchers and clinicians to stay informed in order to guide responsible development and adoption. However, prior reviews have largely focused on narrow domains such as an inventory of potential use cases or performance on board-style examinations, leaving the broader landscape insufficiently characterized. Key questions remain unanswered: How are LLMs in ophthalmology being developed? What applications and evaluation strategies are being pursued? And which areas are closest to real-world clinical adoption? To date, these aspects have not been comprehensively examined. In this study, we conducted a systematic review on LLMs in ophthalmology by manually screening 1,029 studies from PubMed/PMC, Scopus, and Embase published between January 1, 2022, and April 1, 2025, identifying 91 relevant articles. To provide a standardized assessment, we introduced a structured framework that categorizes ophthalmic use cases and stratifies evaluation rigor across five levels of maturity. Each study was manually annotated using 27 structured variables spanning multiple dimensions: scope and purpose (e.g., study aim, ophthalmic subspecialty, input modality); model architecture and training (e.g., backbone LLMs, domain-specific adaptations); evaluation and validation (e.g., target applications, evaluation metrics, level of clinical validation); and resource availability (e.g., model access, licensing, dataset availability). We additionally performed a small-scale, illustrative evaluation of representative emerging models, such as GPT-5.2, gpt-oss-120B, and Gemini 3, to contextualize previously reported results on commonly used ophthalmology tasks. The results show that most studies focused on general-purpose proprietary models, such as GPT-4 and Gemini, while fewer than 10% introduced domain-specific adaptations for ophthalmology, including only 4\% that developed ophthalmology-specific architectures for text-based applications. Multimodal LLMs remain relatively underexplored, with only 23% of studies incorporating imaging data. Evaluation practices reveal a significant translational gap: While 57.1% of studies relied on standard benchmarking and expert review, only 9.9% conducted retrospective validation using real-world clinical data, and just two studies progressed to prospective pilot evaluation. Moreover, although model performance on benchmarks on board-style exams and clinical vignettes has improved with newer model generations, reproducibility and transparency remain limited: only 5.5% of studies released evaluation code, and 33% used publicly available datasets. Finally, we provide a living repository to track the rapid progress of LLMs in ophthalmology for the broader research and clinical community. Full Text Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8819770","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Systematic Review","associatedPublications":[],"authors":[{"id":587637992,"identity":"e3ebce7e-1810-4c5f-9d86-fb168cf989f2","order_by":0,"name":"Hyunjae Kim","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Hyunjae","middleName":"","lastName":"Kim","suffix":""},{"id":587637993,"identity":"1da08a48-2f25-48bb-a87b-cc07ef06cd76","order_by":1,"name":"Yu Yin","email":"","orcid":"","institution":"University of Queensland","correspondingAuthor":false,"prefix":"","firstName":"Yu","middleName":"","lastName":"Yin","suffix":""},{"id":587637994,"identity":"6a866efc-ba12-49cc-9497-0a0f4cf4f97e","order_by":2,"name":"Zhiyuan Cao","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Zhiyuan","middleName":"","lastName":"Cao","suffix":""},{"id":587637995,"identity":"ca66d447-f970-42c6-b5af-826f5a05e8f7","order_by":3,"name":"Chen Liu","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Chen","middleName":"","lastName":"Liu","suffix":""},{"id":587637996,"identity":"7e77d03a-02b4-4e03-a3f8-d8144ea96412","order_by":4,"name":"Anran Li","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Anran","middleName":"","lastName":"Li","suffix":""},{"id":587637997,"identity":"ed95aded-e6bb-4448-9451-f757d641f304","order_by":5,"name":"Zhen Chen","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Zhen","middleName":"","lastName":"Chen","suffix":""},{"id":587637998,"identity":"842ba6e2-a4c0-4f8f-90c6-1e235f9188f4","order_by":6,"name":"Xuguang Ai","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Xuguang","middleName":"","lastName":"Ai","suffix":""},{"id":587637999,"identity":"8ca52d97-c56a-43c2-8b1a-8f19ea202525","order_by":7,"name":"Younjoon Chung","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Younjoon","middleName":"","lastName":"Chung","suffix":""},{"id":587638000,"identity":"8a60b3a4-4d23-4f3f-bbb2-5b60bf0e944b","order_by":8,"name":"Fan Ma","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Fan","middleName":"","lastName":"Ma","suffix":""},{"id":587638001,"identity":"0a3e7bfa-7804-476f-9b84-2c67c305b2af","order_by":9,"name":"Xueping Peng","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Xueping","middleName":"","lastName":"Peng","suffix":""},{"id":587638002,"identity":"f2971a82-3756-4d0d-a491-ceb93d214a9e","order_by":10,"name":"Lingfei Qian","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Lingfei","middleName":"","lastName":"Qian","suffix":""},{"id":587638003,"identity":"b305b663-e396-47cb-b2df-faa3dad1ffdb","order_by":11,"name":"Zhenyue Qin","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Zhenyue","middleName":"","lastName":"Qin","suffix":""},{"id":587638004,"identity":"1eaa1691-e922-44b9-ae84-99f97344c6b5","order_by":12,"name":"Kalpana Raja","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Kalpana","middleName":"","lastName":"Raja","suffix":""},{"id":587638005,"identity":"926ba1f5-b0b2-4935-90d5-84603979cf6e","order_by":13,"name":"Yang Ren","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Yang","middleName":"","lastName":"Ren","suffix":""},{"id":587638006,"identity":"da5abd3c-0e04-427c-acee-fbd16fb42d57","order_by":14,"name":"Weipeng Zhou","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Weipeng","middleName":"","lastName":"Zhou","suffix":""},{"id":587638007,"identity":"ea11e5bf-ec87-4d96-a627-23128aec78cf","order_by":15,"name":"Yih-Chung Tham","email":"","orcid":"","institution":"National University of Singapore","correspondingAuthor":false,"prefix":"","firstName":"Yih-Chung","middleName":"","lastName":"Tham","suffix":""},{"id":587638008,"identity":"6a9f1c9d-a020-451e-a965-a23f8194683b","order_by":16,"name":"Emily Y. Chew","email":"","orcid":"","institution":"National Institutes of Health","correspondingAuthor":false,"prefix":"","firstName":"Emily","middleName":"Y.","lastName":"Chew","suffix":""},{"id":587638009,"identity":"e25e03d4-6b70-433f-98c2-5e50a0de0c3b","order_by":17,"name":"Zhiyong Lu","email":"","orcid":"","institution":"National Institutes of Health","correspondingAuthor":false,"prefix":"","firstName":"Zhiyong","middleName":"","lastName":"Lu","suffix":""},{"id":587638010,"identity":"8edb74ad-1287-47b4-90b9-13c430752517","order_by":18,"name":"Sophia Y. Wang","email":"","orcid":"","institution":"Stanford University","correspondingAuthor":false,"prefix":"","firstName":"Sophia","middleName":"Y.","lastName":"Wang","suffix":""},{"id":587638011,"identity":"c9bdb8ce-f8fe-409d-95c3-1679a43271c1","order_by":19,"name":"Hua Xu","email":"","orcid":"","institution":"Yale University","correspondingAuthor":false,"prefix":"","firstName":"Hua","middleName":"","lastName":"Xu","suffix":""},{"id":587638012,"identity":"1131e85e-fa3d-45af-aea8-6230b94faf86","order_by":20,"name":"Qingyu Chen","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7ElEQVRIiWNgGAWjYDACZihtwMDA+ICBwQLKJlILMxBLEKEFBoDK2CSI0sJ3nPfwa942Bnlz9t5j1Tw1EnIM7M3bJPBpkTzMl2YN1GK4s+dc2m2eYxLGDDzHyvBqMTjMY2YM1JJgcCPH7DYPm0Rig0SOGZFa7r8xK+b5J1HfIP+GoBbjxxBbeMyYedskEhgkePBrkQTawjjnnIThhjM5xpJz+yQM23jSii3waeE7f8b4w5syG3mD42cMP7z5ZiPPz3544w18WhgOMLBJ8UCig4GJB0iw4VUO0cL88QeUzfgDr9JRMApGwSgYqQAASANAM99CHtcAAAAASUVORK5CYII=","orcid":"","institution":"Yale University","correspondingAuthor":true,"prefix":"","firstName":"Qingyu","middleName":"","lastName":"Chen","suffix":""}],"badges":[],"createdAt":"2026-02-08 07:25:33","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-8819770/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8819770/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105751648,"identity":"333af1a3-b390-4fa0-a948-c1ff7c7a97f5","added_by":"auto","created_at":"2026-03-30 15:34:45","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1076008,"visible":true,"origin":"","legend":"","description":"","filename":"OphLLMreview4.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8819770/v1_covered_a1644fb3-9c2b-4303-94ef-ae929c685a12.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eHow Far Have Large Language Models Advanced in Ophthalmology? A Systematic Review of Their Development, Evaluation, and Readiness for Clinical Use\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Yale University","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8819770/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8819770/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLarge language models (LLMs) are rapidly transforming ophthalmology, with expanding applications in patient care, clinical documentation, and medical education.\u0026nbsp;Recent studies span a wide range of use cases, from early text-only applications to emerging multimodal systems that integrate ophthalmic images to support diagnosis and generate assessment and treatment plans. Amid this rapid progress, it is critical for both researchers and clinicians to stay informed in order to guide responsible development and adoption.\u0026nbsp;However, prior reviews have largely focused on narrow domains such as an inventory of potential use cases or performance on board-style examinations, leaving the broader landscape insufficiently characterized. Key questions remain unanswered: How are LLMs in ophthalmology being developed? What applications and evaluation strategies are being pursued? And which areas are closest to real-world clinical adoption? To date, these aspects have not been comprehensively examined.\u003c/p\u003e\n\u003cp\u003eIn this study, we conducted a systematic review on LLMs in ophthalmology by manually screening 1,029 studies from PubMed/PMC, Scopus, and Embase published between January 1, 2022, and April 1, 2025, identifying 91 relevant articles. To provide a standardized assessment, we introduced a structured framework that categorizes ophthalmic use cases and stratifies evaluation rigor across five levels of maturity. Each study was manually annotated using 27 structured variables spanning multiple dimensions: scope and purpose (e.g., study aim, ophthalmic subspecialty, input modality); model architecture and training (e.g., backbone LLMs, domain-specific adaptations); evaluation and validation (e.g., target applications, evaluation metrics, level of clinical validation); and resource availability (e.g., model access, licensing, dataset availability). We additionally performed a small-scale, illustrative evaluation of representative emerging models, such as GPT-5.2, gpt-oss-120B, and Gemini 3, to contextualize previously reported results on commonly used ophthalmology tasks.\u003c/p\u003e\n\u003cp\u003eThe results show that most studies focused on general-purpose proprietary models, such as GPT-4 and Gemini, while fewer than 10% introduced domain-specific adaptations for ophthalmology, including only 4\\% that developed ophthalmology-specific architectures for text-based applications.\u0026nbsp;Multimodal LLMs remain relatively underexplored, with only 23% of studies incorporating imaging data. Evaluation practices reveal a significant translational gap: While 57.1% of studies relied on standard benchmarking and expert review, only 9.9% conducted retrospective validation using real-world clinical data, and just two studies progressed to prospective pilot evaluation. Moreover, although model performance on benchmarks on board-style exams and clinical vignettes has improved with newer model generations, reproducibility and transparency remain limited: only 5.5% of studies released evaluation code, and 33% used publicly available datasets. Finally, we provide a living repository to track the rapid progress of LLMs in ophthalmology for the broader research and clinical community.\u003c/p\u003e","manuscriptTitle":"How Far Have Large Language Models Advanced in Ophthalmology? A Systematic Review of Their Development, Evaluation, and Readiness for Clinical Use","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-10 09:29:22","doi":"10.21203/rs.3.rs-8819770/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"3a6e839f-0ba6-49b6-b414-b58af5ce0ba0","owner":[],"postedDate":"February 10th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-02-10T09:29:22+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-10 09:29:22","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8819770","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8819770","identity":"rs-8819770","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00