PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration Abdul Rehman Akbar, Samuel Wales-McGrath, Alejadro Levya, Lina Gokhale, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9076912/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Pathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma. We present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture. Evaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review. By unifying retrieval and reasoning, PathoScribe enables pathologists to move from isolated case interpretation toward data-informed decision-making grounded in institutional precedent. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms. Biological sciences/Cancer Biological sciences/Computational biology and bioinformatics Health sciences/Health care Physical sciences/Mathematics and computing Health sciences/Medical research Full Text Additional Declarations No competing interests reported. Supplementary Files PathoScribenpjDigitalMedsupplementary.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9076912","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":603942438,"identity":"9e03e5b5-d41f-4e26-818f-f255fe4fef74","order_by":0,"name":"Abdul Rehman Akbar","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA7ElEQVRIie3NsYrCQBCA4ZGFsYmkXfElRgJ7BgM+iE1AiI2phMNSCGgj2EbwIbTRVlmwE9tILLRJdUWqqwInqydYrZYW+3czzMcAmEwfGHIgOAMnAiidH1uuI7YiPniKMHqLVOMbCRRB/hahNFpxH6TzVd5ng7yQbXvItqmlI6fdtyLCnXTFcTaSYbzBTlNLkp5QxKNNgGllKMMFWKKmI607+fPokGHfKhSxf7Xk8UVQEiCz8PYF9eS06zd8ko4bZ6w6G3XDWKLjznUkjZZJPpD1tR2U8rxohtNxdEl+NOQfPg/s5bnJZDKZXnUFBrFNtaa55IwAAAAASUVORK5CYII=","orcid":"","institution":"The Ohio State University Wexner Medical Center","correspondingAuthor":true,"prefix":"","firstName":"Abdul","middleName":"Rehman","lastName":"Akbar","suffix":""},{"id":603942439,"identity":"fc1d3935-0e8f-4e09-ac8c-d455f188ca3c","order_by":1,"name":"Samuel Wales-McGrath","email":"","orcid":"","institution":"The Ohio State University Wexner Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Samuel","middleName":"","lastName":"Wales-McGrath","suffix":""},{"id":603942440,"identity":"4e89e1df-19bf-4f94-9f32-00167b917066","order_by":2,"name":"Alejadro Levya","email":"","orcid":"","institution":"The Ohio State University Wexner Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Alejadro","middleName":"","lastName":"Levya","suffix":""},{"id":603942441,"identity":"285091bd-410e-48c7-968e-8476b877be93","order_by":3,"name":"Lina Gokhale","email":"","orcid":"","institution":"The Ohio State University Wexner Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Lina","middleName":"","lastName":"Gokhale","suffix":""},{"id":603942442,"identity":"739da11a-c5a2-4b44-981d-3fdaea1934b6","order_by":4,"name":"Rajendra Singh","email":"","orcid":"","institution":"The University of Pennsylvania","correspondingAuthor":false,"prefix":"","firstName":"Rajendra","middleName":"","lastName":"Singh","suffix":""},{"id":603942443,"identity":"85db5645-0b48-4a46-a260-1245f706870e","order_by":5,"name":"Wei Chen","email":"","orcid":"","institution":"The Ohio State University Wexner Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Wei","middleName":"","lastName":"Chen","suffix":""},{"id":603942444,"identity":"d9eddffd-1406-4c6d-a7ae-8ba5e1e51c9a","order_by":6,"name":"Anil Parwani","email":"","orcid":"","institution":"The Ohio State University Wexner Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Anil","middleName":"","lastName":"Parwani","suffix":""},{"id":603942445,"identity":"0c392784-ffea-4c28-ad51-fd4a5d093cf4","order_by":7,"name":"Muhammad Khalid Khan Niazi","email":"","orcid":"","institution":"The Ohio State University Wexner Medical Center","correspondingAuthor":false,"prefix":"","firstName":"Muhammad","middleName":"Khalid Khan","lastName":"Niazi","suffix":""}],"badges":[],"createdAt":"2026-03-09 21:23:16","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9076912/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9076912/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105694642,"identity":"5714dcb6-0404-4fe2-aac1-88fd9d334e39","added_by":"auto","created_at":"2026-03-30 03:25:36","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1666251,"visible":true,"origin":"","legend":"","description":"","filename":"PathoScribenpjDigitalMedfinal.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9076912/v1_covered_0cf635ff-89b6-4f8c-accb-5ff7fcdb7b9d.pdf"},{"id":104477984,"identity":"4c19ff8a-6968-4123-8941-3b1c0972679d","added_by":"auto","created_at":"2026-03-12 08:42:27","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":37959,"visible":true,"origin":"","legend":"","description":"","filename":"PathoScribenpjDigitalMedsupplementary.docx","url":"https://assets-eu.researchsquare.com/files/rs-9076912/v1/28613276bf2e311f13812bd3.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-9076912/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9076912/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003ePathology underpins modern diagnosis and cancer care, yet its most valuable asset, the accumulated experience encoded in millions of narrative reports, remains largely inaccessible. Although institutions are rapidly digitizing pathology workflows, storing data without effective mechanisms for retrieval and reasoning risks transforming archives into a passive data repository, where institutional knowledge exists but cannot meaningfully inform patient care. True progress requires not only digitization, but the ability for pathologists to interrogate prior similar cases in real time while evaluating a new diagnostic dilemma.\u003c/p\u003e \u003cp\u003eWe present PathoScribe, a unified retrieval-augmented large language model (LLM) framework designed to transform static pathology archives into a searchable, reasoning-enabled living library. PathoScribe enables natural language case exploration, automated cohort construction, clinical question answering, immunohistochemistry (IHC) panel recommendation, and prompt-controlled report transformation within a single architecture.\u003c/p\u003e \u003cp\u003eEvaluated on 70,000 multi-institutional surgical pathology reports, PathoScribe achieved perfect Recall@10 for natural language case retrieval and demonstrated high-quality retrieval-grounded reasoning (mean reviewer score 4.56/5). Critically, the system operationalized automated cohort construction from free-text eligibility criteria, assembling research-ready cohorts in minutes (mean 9.2 minutes) with 91.3% agreement to human reviewers and no eligible cases incorrectly excluded, representing orders-of-magnitude reductions in time and cost compared to traditional manual chart review.\u003c/p\u003e \u003cp\u003eBy unifying retrieval and reasoning, PathoScribe enables pathologists to move from isolated case interpretation toward data-informed decision-making grounded in institutional precedent. This work establishes a scalable foundation for converting digital pathology archives from passive storage systems into active clinical intelligence platforms.\u003c/p\u003e","manuscriptTitle":"PathoScribe: Transforming Pathology Data into a Living Library with a Unified LLM-Driven Framework for Semantic Retrieval and Clinical Integration","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-12 08:41:10","doi":"10.21203/rs.3.rs-9076912/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"918564c6-5b03-4e02-ba77-0592c87706c5","owner":[],"postedDate":"March 12th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":64268329,"name":"Biological sciences/Cancer"},{"id":64268330,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":64268331,"name":"Health sciences/Health care"},{"id":64268332,"name":"Physical sciences/Mathematics and computing"},{"id":64268333,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2026-03-30T03:24:35+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-12 08:41:10","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9076912","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9076912","identity":"rs-9076912","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.