Impact Classification within and beyond Academia: Domain-Robust Annotation and the Capacity of Large Language Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Impact Classification within and beyond Academia: Domain-Robust Annotation and the Capacity of Large Language Models Maria Becker, Kanyao Han, Rezvaneh Rezapour, Jana Diesner, Andreas Witt This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5543205/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 9 You are reading this latest preprint version Abstract Prior analyses and assessments of the impact of scientific research has mainly relied on analyzing its scope within academia and its influence within scholarly circles. However, by not considering the broader societal, economic, and policy implications of research projects, these studies overlook the ways in which scientific discoveries contribute to technological innovation, public health improvements, environmental sustainability, and other areas of real-world application. We expand upon this prior work by developing and validating a conceptual and computational solution to automatically identify and categorize the impact of scientific research within and especially beyond academia based on text data. We first empirically develop and evaluate an annotation schema to capture and classify the impact of research projects based on research reports from different scientific domains. We then annotate a large dataset of more than 45k sentences extracted from research reports for the developed impact categories. We examine the annotated dataset for patterns in the distribution of impact categories across different scientific domains, co-occurrences of impact categories, and signal words of impact. Using the annotated texts and the novel classification schema, we investigate the performance of large language models (LLMs) for automated impact classification. Our results show that fine-tuning the models on our annotated datasets statistically significantly outperforms zero- and fewshot prompting approaches. This indicates that state-of-the-art LLMs without fine-tuning may not work well for novel classification schemas such as our impact classification schema, and in turn highlights the importance of diligent manual annotations as empirical basis in the field of computational social science. Impact classification Sentence-level annotation Social impact Project report Cross-domain Large language model Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 18 Jul, 2025 Reviews received at journal 12 Feb, 2025 Reviewers agreed at journal 17 Jan, 2025 Reviewers agreed at journal 16 Jan, 2025 Reviewers agreed at journal 14 Jan, 2025 Reviewers invited by journal 14 Jan, 2025 Editor assigned by journal 14 Jan, 2025 Submission checks completed at journal 29 Nov, 2024 First submitted to journal 28 Nov, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5543205","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":403716134,"identity":"7c5c29ea-e571-4c6f-aa79-b0ce4f7c3d01","order_by":0,"name":"Maria Becker","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABUElEQVRIie2RwUrDMBjHv5CDl3S9ZmDnEwgZhaqwh7EUttPA4w5SMwbdJVuvHb5E32AtgZ02vBbcQRn05MGxS51FbGcFO6Z4FOyPQL4k349/QgAqKv4owW6QrHrerRHHABTxI1Isf1CQV1Lw9wp8KkVPVnyIh5XT4eIxSGDZUIWcbVpbG9j9gOOr3rnmYiVc9baguWXFmHdYKCDW6cKxbrtjCWwZcuzNqT4Z1KzmYgz6pBxjBG0ICEjTV4mOuyIAFpnDV8Whpi+JUe8LMP2grNzFEKYgb3xV3eALYecKx8obNaeSnL3kynRPidogs5RLpgiMIcGFwrMUTAzEkyxl7y1RDPKYxU1vPtPRiEtSzxUyo7ona1a9z7Ni/2JtvH7qLU9UYa0gSe1GLeo8YHJta+5wFK552tLcQx/Dihk5QMonyKEH+r+S/mKnoqKi4t/xDli3fUEfe9bKAAAAAElFTkSuQmCC","orcid":"","institution":"Leibniz Institute for the German Language","correspondingAuthor":true,"prefix":"","firstName":"Maria","middleName":"","lastName":"Becker","suffix":""},{"id":403716135,"identity":"505626c8-da2a-48d3-a3dc-eb6b9a679009","order_by":1,"name":"Kanyao Han","email":"","orcid":"","institution":"University of Illinois Urbana-Champaign","correspondingAuthor":false,"prefix":"","firstName":"Kanyao","middleName":"","lastName":"Han","suffix":""},{"id":403716136,"identity":"a6ed6149-0791-441e-a962-adf6f6bc82a8","order_by":2,"name":"Rezvaneh Rezapour","email":"","orcid":"","institution":"Drexel University","correspondingAuthor":false,"prefix":"","firstName":"Rezvaneh","middleName":"","lastName":"Rezapour","suffix":""},{"id":403716137,"identity":"0c51664b-852a-49b8-9319-be20d4de9baa","order_by":3,"name":"Jana Diesner","email":"","orcid":"","institution":"Technical University of Munich","correspondingAuthor":false,"prefix":"","firstName":"Jana","middleName":"","lastName":"Diesner","suffix":""},{"id":403716138,"identity":"976fae8d-b7cb-4101-aa8e-c8d6fd899d83","order_by":4,"name":"Andreas Witt","email":"","orcid":"","institution":"Leibniz Institute for the German Language","correspondingAuthor":false,"prefix":"","firstName":"Andreas","middleName":"","lastName":"Witt","suffix":""}],"badges":[],"createdAt":"2024-11-28 14:08:13","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5543205/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5543205/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":74573538,"identity":"8e53c1c0-936f-4d78-97a6-19dfc9f6e423","added_by":"auto","created_at":"2025-01-23 14:49:46","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":829681,"visible":true,"origin":"","legend":"","description":"","filename":"impactannotationandclassification.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5543205/v1_covered_daeb6eee-bde0-4958-a2cb-7827920c9377.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Impact Classification within and beyond Academia: Domain-Robust Annotation and the Capacity of Large Language Models","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"language-resources-and-evaluation","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"lrev","sideBox":"Learn more about [Language Resources and Evaluation](http://link.springer.com/journal/10579)","snPcode":"10579","submissionUrl":"https://submission.nature.com/new-submission/10579/3","title":"Language Resources and Evaluation","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Impact classification, Sentence-level annotation, Social impact, Project report, Cross-domain, Large language model","lastPublishedDoi":"10.21203/rs.3.rs-5543205/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5543205/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Prior analyses and assessments of the impact of scientific research has mainly relied on analyzing its scope within academia and its influence within scholarly circles. However, by not considering the broader societal, economic, and policy implications of research projects, these studies overlook the ways in which scientific discoveries contribute to technological innovation, public health improvements, environmental sustainability, and other areas of real-world application. We expand upon this prior work by developing and validating a conceptual and computational solution to automatically identify and categorize the impact of scientific research within and especially beyond academia based on text data. We first empirically develop and evaluate an annotation schema to capture and classify the impact of research projects based on research reports from different scientific domains. We then annotate a large dataset of more than 45k sentences extracted from research reports for the developed impact categories. We examine the annotated dataset for patterns in the distribution of impact categories across different scientific domains, co-occurrences of impact categories, and signal words of impact. Using the annotated texts and the novel classification schema, we investigate the performance of large language models (LLMs) for automated impact classification. Our results show that fine-tuning the models on our annotated datasets statistically significantly outperforms zero- and fewshot prompting approaches. This indicates that state-of-the-art LLMs without fine-tuning may not work well for novel classification schemas such as our impact classification schema, and in turn highlights the importance of diligent manual annotations as empirical basis in the field of computational social science.","manuscriptTitle":"Impact Classification within and beyond Academia: Domain-Robust Annotation and the Capacity of Large Language Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-01-23 14:41:41","doi":"10.21203/rs.3.rs-5543205/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-07-18T11:04:31+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-02-12T07:53:00+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"132543350342067422230613672471089912370","date":"2025-01-17T21:31:39+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"119646937066592897833356338241053600440","date":"2025-01-17T00:40:04+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"338389404836321392995556748388865804396","date":"2025-01-14T20:18:35+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-01-14T16:59:12+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-01-14T16:40:01+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-11-29T15:44:01+00:00","index":"","fulltext":""},{"type":"submitted","content":"Language Resources and Evaluation","date":"2024-11-28T13:54:12+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"language-resources-and-evaluation","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"lrev","sideBox":"Learn more about [Language Resources and Evaluation](http://link.springer.com/journal/10579)","snPcode":"10579","submissionUrl":"https://submission.nature.com/new-submission/10579/3","title":"Language Resources and Evaluation","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"cf88930d-2331-4043-a9b9-aa38e052bf8a","owner":[],"postedDate":"January 23rd, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-05-08T16:54:31+00:00","versionOfRecord":[],"versionCreatedAt":"2025-01-23 14:41:41","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5543205","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5543205","identity":"rs-5543205","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.