Multi-scale and Multi-feature fusion speech emotion recognition based on cross-attention

doi:10.21203/rs.3.rs-5859778/v1

Multi-scale and Multi-feature fusion speech emotion recognition based on cross-attention

2025 · doi:10.21203/rs.3.rs-5859778/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 11,516 characters · extracted from preprint-html · click to expand

Multi-scale and Multi-feature fusion speech emotion recognition based on cross-attention | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Multi-scale and Multi-feature fusion speech emotion recognition based on cross-attention Ning Li, Wenjiao Zhang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5859778/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 4 You are reading this latest preprint version Abstract Speech Emotion Recognition (SER) which aims to help the machine to understand human emotions from speech, has emerged as an integral component within Human-computer Interaction (HCI). There are two critical challenges in the SER field. One is that rich emotional features at different scales cannot be well captured due to the restrictions of existing CNNs. The other is that due to the limitations of existing methods, it is difficult to fuse multiple feature information effectively. A multi-scale and multi-feature fusion speech emotion recognition model based on cross-attention is proposed in this paper. First, according to the characteristics of MFCC and log Mel spectrogram, 1D convolution and 2D convolution were used to extract their advanced features, respectively. Second, adding residual multi-scale module to convolutional neural networks aims at high-level emotional features at different scales and obtain richer fine-grained emotional features. Third, the features obtained after the convolutional neural network are fused using the cross-attention module, which aims to explicitly simulate the fine-grained interaction between multiple features and improve the effectiveness of multi-feature fusion. Finally, the fused features are fed to BiLSTM to extract temporal features, and it is fed into a fully connected classifier for emotion recognition. The experimental results on the benchmark dataset IEMOCAP show that this method improves WA and UA by 1.67% and 2.20% compared with other methods, respectively. speech emotion recognition multi-scale multi-feature cross-attention Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Reviewers invited by journal 05 May, 2026 Editor assigned by journal 29 Jan, 2025 Submission checks completed at journal 29 Jan, 2025 First submitted to journal 19 Jan, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5859778","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":408612619,"identity":"d1a6e864-417a-49aa-87e6-406bbfbd508a","order_by":0,"name":"Ning Li","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAu0lEQVRIiWNgGAWjYBACxoYEBoYPDAdAbAPitTDOIEkLA0MCAzMPSVqY23PMpG3+3ElsYG/eJsFQc4cIh/U8S5PO4XmW2MBzrEyC4dgzIrTMSD4mnSNxOLFBIsdMgrHhMDFaEtukLQyAWuTfEK0FaAtDAsgWHmK19DxLtuw58My4jSet2CLhGBFaDNtzDG/8+HNHtp/98MYbH2qI0dLAwCIBYrCBiATCGhgY5IFR84EYhaNgFIyCUTCCAQB9kDscVhpkpQAAAABJRU5ErkJggg==","orcid":"","institution":"Zhengzhou Business University","correspondingAuthor":true,"prefix":"","firstName":"Ning","middleName":"","lastName":"Li","suffix":""},{"id":408612621,"identity":"02931473-a08d-4522-8c7b-64ae32c4fc54","order_by":1,"name":"Wenjiao Zhang","email":"","orcid":"","institution":"Zhengzhou Business University","correspondingAuthor":false,"prefix":"","firstName":"Wenjiao","middleName":"","lastName":"Zhang","suffix":""}],"badges":[],"createdAt":"2025-01-19 13:53:11","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5859778/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5859778/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":75148515,"identity":"6315a33f-9f04-4621-82d0-44ccbf1d7f69","added_by":"auto","created_at":"2025-01-31 07:59:24","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":601171,"visible":true,"origin":"","legend":"","description":"","filename":"MultiscaleandMultifeaturefusionspeechemotionrecognitionbasedoncrossattention.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5859778/v1_covered_144216b2-60af-4962-b63a-d91d4cf5ac49.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Multi-scale and Multi-feature fusion speech emotion recognition based on cross-attention","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"international-journal-of-speech-technology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijst","sideBox":"Learn more about [International Journal of Speech Technology](http://link.springer.com/journal/10772)","snPcode":"10772","submissionUrl":"https://submission.springernature.com/new-submission/10772/3","title":"International Journal of Speech Technology","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"speech emotion recognition, multi-scale, multi-feature, cross-attention","lastPublishedDoi":"10.21203/rs.3.rs-5859778/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5859778/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eSpeech Emotion Recognition (SER) which aims to help the machine to understand human emotions from speech, has emerged as an integral component within Human-computer Interaction (HCI). There are two critical challenges in the SER field. One is that rich emotional features at different scales cannot be well captured due to the restrictions of existing CNNs. The other is that due to the limitations of existing methods, it is difficult to fuse multiple feature information effectively. A multi-scale and multi-feature fusion speech emotion recognition model based on cross-attention is proposed in this paper. First, according to the characteristics of MFCC and log Mel spectrogram, 1D convolution and 2D convolution were used to extract their advanced features, respectively. Second, adding residual multi-scale module to convolutional neural networks aims at high-level emotional features at different scales and obtain richer fine-grained emotional features. Third, the features obtained after the convolutional neural network are fused using the cross-attention module, which aims to explicitly simulate the fine-grained interaction between multiple features and improve the effectiveness of multi-feature fusion. Finally, the fused features are fed to BiLSTM to extract temporal features, and it is fed into a fully connected classifier for emotion recognition. The experimental results on the benchmark dataset IEMOCAP show that this method improves WA and UA by 1.67% and 2.20% compared with other methods, respectively.\u003c/p\u003e","manuscriptTitle":"Multi-scale and Multi-feature fusion speech emotion recognition based on cross-attention","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-01-31 07:27:19","doi":"10.21203/rs.3.rs-5859778/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewersInvited","content":"","date":"2026-05-05T12:48:39+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-01-29T11:41:46+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-01-29T11:40:24+00:00","index":"","fulltext":""},{"type":"submitted","content":"International Journal of Speech Technology","date":"2025-01-19T13:42:10+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"international-journal-of-speech-technology","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"ijst","sideBox":"Learn more about [International Journal of Speech Technology](http://link.springer.com/journal/10772)","snPcode":"10772","submissionUrl":"https://submission.springernature.com/new-submission/10772/3","title":"International Journal of Speech Technology","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"21275b66-db2d-4e3e-9aa5-0b33b8f0712e","owner":[],"postedDate":"January 31st, 2025","published":true,"recentEditorialEvents":[{"type":"reviewersInvited","content":"20","date":"2026-05-05T12:48:39+00:00","index":"","fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2025-01-31T07:27:19+00:00","versionOfRecord":[],"versionCreatedAt":"2025-01-31 07:27:19","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5859778","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5859778","identity":"rs-5859778","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0