Machine Learning for Sentiment-Based Corporate Disclosure Analytics: A Systematic Review of Data, Sentiment Representations, and Predictive Models | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Systematic Review Machine Learning for Sentiment-Based Corporate Disclosure Analytics: A Systematic Review of Data, Sentiment Representations, and Predictive Models Ramon Abilio, Guilherme Palermo Coelho, Ana Estela Antunes da Silva This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9053199/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Machine learning methods have been widely used to predict stock prices using technical indicators and sentiment features, mostly extracted from social media and news. However, less attention has been given to how sentiment-based textual features obtained from corporate reports are integrated into machine learning pipelines to predict firms' financial outcomes. To examine this issue, we conducted a systematic review of 42 studies published between 2014 and 2025. The review examines how datasets are constructed, how sentiment representations are defined, and how predictive models combine textual features with financial variables. Most studies focus on the U.S. stock market and rely on feature-engineered sentiment indices derived from lexicons or sentence-level classification. Regression-based and other supervised learning approaches remain dominant, while embedding-based representations and end-to-end deep learning architectures appear only sporadically. The literature also reveals constraints, including challenges in processing long financial documents, limited availability of labeled datasets, and strong geographic and linguistic concentration. In addition, the review identifies highly heterogeneous modeling approaches with limited convergence toward shared benchmark tasks. These findings highlight research opportunities for machine learning applications in finance and for the development of sentiment-based corporate disclosure analytics. deep learning dataset sentiment analysis sentiment index financial reports stocks Full Text Additional Declarations Competing interest reported. The Article Processing Charge (APC) for the publication of this research was funded by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) (ROR identifier: 00x0ma614). For the purposes of open access, the authors have applied the Creative Commons CC BY license to any accepted version of the manuscript. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9053199","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Systematic Review","associatedPublications":[],"authors":[{"id":603451663,"identity":"37b3cad9-3eac-4f14-a3c6-773f06867644","order_by":0,"name":"Ramon Abilio","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA9klEQVRIiWNgGAWjYBCDBDZ2xgaGDwZyYJ4EA8MBIrQwMzYwzjAwJkELAzMDAzMPAxFazNtPJz4uYLDL42NmbnxsU2Bgb87AfPA2D8OdfFxaZM7kbjaewZBcDHRYs3GOgUHizga2ZGsehmeWDTi0SDDkbpPmYWBObGNmbJPOMfiTYHCAxwwoctgAly0S/G9BWuohWiwMDOwNDvB/w69FAmzLYYgWBgMDxg0HeNgIaHm72ZjH4DjYL4Y9IL80sxlbzjF4hsdhuRsf81RU58m3tz988OMPMMTYmx/eeFNxB6cWCECWNmBGFyEISFI8CkbBKBgFIwIAAAwZR4uFmeOUAAAAAElFTkSuQmCC","orcid":"","institution":"Instituto Federal de São Paulo (IFSP)","correspondingAuthor":true,"prefix":"","firstName":"Ramon","middleName":"","lastName":"Abilio","suffix":""},{"id":603451664,"identity":"ed822c39-120c-47db-837d-2806d29ed65a","order_by":1,"name":"Guilherme Palermo Coelho","email":"","orcid":"","institution":"Universidade Estadual de Campinas (Unicamp","correspondingAuthor":false,"prefix":"","firstName":"Guilherme","middleName":"Palermo","lastName":"Coelho","suffix":""},{"id":603451665,"identity":"43d13e7c-46cc-471b-9584-a60f88061a0d","order_by":2,"name":"Ana Estela Antunes da Silva","email":"","orcid":"","institution":"Universidade Estadual de Campinas (Unicamp","correspondingAuthor":false,"prefix":"","firstName":"Ana","middleName":"Estela Antunes da","lastName":"Silva","suffix":""}],"badges":[],"createdAt":"2026-03-06 18:24:02","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9053199/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9053199/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108923691,"identity":"28d4e62c-c13e-42bb-b0da-dfaf24fa4a65","added_by":"auto","created_at":"2026-05-10 20:09:59","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":7027903,"visible":true,"origin":"","legend":"","description":"","filename":"JDSARSLsubmitted.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9053199/v1_covered_77f27720-6f0e-4ed9-8820-3d5e7436c00a.pdf"}],"financialInterests":"Competing interest reported. The Article Processing Charge (APC) for the publication of this research was funded by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) (ROR identifier: 00x0ma614). For the purposes of open access, the authors have applied the Creative Commons CC BY license to any accepted version of the manuscript.","formattedTitle":"Machine Learning for Sentiment-Based Corporate Disclosure Analytics: A Systematic Review of Data, Sentiment Representations, and Predictive Models","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"deep learning, dataset, sentiment analysis, sentiment index, financial reports, stocks","lastPublishedDoi":"10.21203/rs.3.rs-9053199/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9053199/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMachine learning methods have been widely used to predict stock prices using technical indicators and sentiment features, mostly extracted from social media and news. However, less attention has been given to how sentiment-based textual features obtained from corporate reports are integrated into machine learning pipelines to predict firms' financial outcomes. To examine this issue, we conducted a systematic review of 42 studies published between 2014 and 2025. The review examines how datasets are constructed, how sentiment representations are defined, and how predictive models combine textual features with financial variables. Most studies focus on the U.S. stock market and rely on feature-engineered sentiment indices derived from lexicons or sentence-level classification. Regression-based and other supervised learning approaches remain dominant, while embedding-based representations and end-to-end deep learning architectures appear only sporadically. The literature also reveals constraints, including challenges in processing long financial documents, limited availability of labeled datasets, and strong geographic and linguistic concentration. In addition, the review identifies highly heterogeneous modeling approaches with limited convergence toward shared benchmark tasks. These findings highlight research opportunities for machine learning applications in finance and for the development of sentiment-based corporate disclosure analytics.\u003c/p\u003e","manuscriptTitle":"Machine Learning for Sentiment-Based Corporate Disclosure Analytics: A Systematic Review of Data, Sentiment Representations, and Predictive Models","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-12 09:40:46","doi":"10.21203/rs.3.rs-9053199/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"fe391c1d-4381-4906-9769-53180e0daaf1","owner":[],"postedDate":"March 12th, 2026","published":true,"recentEditorialEvents":[{"type":"decision","content":"Rejected","date":"2026-05-10T19:58:08+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-04T07:11:58+00:00","index":82,"fulltext":""},{"type":"reviewerAgreed","content":"27674229946585903144541046822379903632","date":"2026-04-30T04:50:36+00:00","index":81,"fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-05-10T20:09:45+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-12 09:40:46","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9053199","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9053199","identity":"rs-9053199","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.