Large Language Model-Assisted Data Extraction in Systematic Reviews: A Case Study of Automated Comparative Performance Evaluation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Large Language Model-Assisted Data Extraction in Systematic Reviews: A Case Study of Automated Comparative Performance Evaluation Arun Varghese, Vidhi Verma, Tom Feiler, Karie Riley, Farha Zindah, and 5 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9407086/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Artificial intelligence (AI) and machine learning technologies are now utilized as a mainstream approach to increase efficiencies in systematic literature reviews in the health sciences and related fields. Automation in systematic reviews has largely focused on information retrieval and study screening, with comparatively limited progress in automating the more time-intensive data extraction step. The advent of generative AI, based on large language models (LLMs), that promise out-of-the-box PhD-level expertise across a range of domains offers a readily implementable means for automating the data extraction step. This case study attempts to address the following interrelated real-world problems in the context of LLMs in data extraction for systematic review: (i) how well do LLMs perform at question answering to support data extraction? (ii) how can the best performing LLM be identified? (iii) can LLM-generated responses be automatically scored? and (iv) how closely do automated performance metrics compare to human scores? The experimental and statistical frameworks developed in this case study could be replicated to assist researchers and regulators in efficient determination of the optimal automation approach for their projects. literature review systematic review artificial intelligence large language models machine learning natural language processingliterature review systematic review artificial intelligence large language models machine learning natural language processing Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9407086","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":629947078,"identity":"5ab6fdd1-e2bd-41f4-b44f-28810159d149","order_by":0,"name":"Arun Varghese","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA30lEQVRIiWNgGAWjYFADCRBRIcHA3kCaljMSDDwHSNLC2MZAWAv/7DNmDxjb7PLlo5sPf/g4z0Keh4H54aMb+Mw+l2NuwNiWbLnxzrE0yZnbJAx7GNiMjXPwWXOGx0yC4QyzgeGMHDNm3m0SCfYMPGzS+LTIQ7TUA7Xkf/78d45EAg8hLQZgLRWHDeQlchikGRuI0GJ4hq1MIqHiuIGBRJqZZM8xoF+YCfhF7gzzNokPBtUG8jOSH3/4UVMnz8Pe/PAxXu+DQALIhQdgPGZCymFAvoFYlaNgFIyCUTDiAAAO6T//zT1mjgAAAABJRU5ErkJggg==","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Arun","middleName":"","lastName":"Varghese","suffix":""},{"id":629947079,"identity":"171d0c6b-07e9-454b-83a6-d25c86f5c583","order_by":1,"name":"Vidhi Verma","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Vidhi","middleName":"","lastName":"Verma","suffix":""},{"id":629947080,"identity":"ee1b0839-0bfb-4bfc-b53a-9cdda451ed23","order_by":2,"name":"Tom Feiler","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Tom","middleName":"","lastName":"Feiler","suffix":""},{"id":629947081,"identity":"ab23d4a8-1646-4408-a778-b318ad3a05bf","order_by":3,"name":"Karie Riley","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Karie","middleName":"","lastName":"Riley","suffix":""},{"id":629947082,"identity":"e1878a17-1159-47b7-83a7-1256b79a79cf","order_by":4,"name":"Farha Zindah","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Farha","middleName":"","lastName":"Zindah","suffix":""},{"id":629947083,"identity":"5d980db0-e16d-4266-a396-1dbe26102e63","order_by":5,"name":"Meredith Clemons","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Meredith","middleName":"","lastName":"Clemons","suffix":""},{"id":629947084,"identity":"18dafd30-d582-4ebf-89cc-086cb60e453f","order_by":6,"name":"Anthony Hannani","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Anthony","middleName":"","lastName":"Hannani","suffix":""},{"id":629947085,"identity":"4f4fa37c-03dd-4ba5-a128-9dbb9b88727b","order_by":7,"name":"Samantha Snow","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Samantha","middleName":"","lastName":"Snow","suffix":""},{"id":629947086,"identity":"ca4d87b3-b8ed-421e-8a2b-7cb68f21b8b2","order_by":8,"name":"Kevin Hobbie","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Kevin","middleName":"","lastName":"Hobbie","suffix":""},{"id":629947087,"identity":"c8fb06f3-ef5c-4be1-8a80-a4dcc00cfb3a","order_by":9,"name":"Jessica Wignall","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Jessica","middleName":"","lastName":"Wignall","suffix":""}],"badges":[],"createdAt":"2026-04-13 17:23:56","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9407086/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9407086/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108804374,"identity":"fd16e685-9064-4e07-9e05-f9fb899ffae9","added_by":"auto","created_at":"2026-05-08 15:19:57","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":408106,"visible":true,"origin":"","legend":"","description":"","filename":"LLMEvaluationDraftFinalBlinded041826.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9407086/v1_covered_d14f530d-ae07-410f-8fad-0f5052742f7c.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Large Language Model-Assisted Data Extraction in Systematic Reviews: A Case Study of Automated Comparative Performance Evaluation","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"literature review, systematic review, artificial intelligence, large language models, machine learning, natural language processingliterature review, systematic review, artificial intelligence, large language models, machine learning, natural language processing","lastPublishedDoi":"10.21203/rs.3.rs-9407086/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9407086/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Artificial intelligence (AI) and machine learning technologies are now utilized as a mainstream approach to increase efficiencies in systematic literature reviews in the health sciences and related fields. Automation in systematic reviews has largely focused on information retrieval and study screening, with comparatively limited progress in automating the more time-intensive data extraction step. The advent of generative AI, based on large language models (LLMs), that promise out-of-the-box PhD-level expertise across a range of domains offers a readily implementable means for automating the data extraction step. This case study attempts to address the following interrelated real-world problems in the context of LLMs in data extraction for systematic review: (i) how well do LLMs perform at question answering to support data extraction? (ii) how can the best performing LLM be identified? (iii) can LLM-generated responses be automatically scored? and (iv) how closely do automated performance metrics compare to human scores? The experimental and statistical frameworks developed in this case study could be replicated to assist researchers and regulators in efficient determination of the optimal automation approach for their projects.","manuscriptTitle":"Large Language Model-Assisted Data Extraction in Systematic Reviews: A Case Study of Automated Comparative Performance Evaluation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-05-05 12:04:55","doi":"10.21203/rs.3.rs-9407086/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"8e9d4968-d277-40b5-bb5f-65e02e6e4e54","owner":[],"postedDate":"May 5th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-05-05T12:04:55+00:00","versionOfRecord":[],"versionCreatedAt":"2026-05-05 12:04:55","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9407086","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9407086","identity":"rs-9407086","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.