Feature-level interaction and adaptive fusion model based on cross-modal attention for audiovisual emotion recognition | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Feature-level interaction and adaptive fusion model based on cross-modal attention for audiovisual emotion recognition Shuqiu Tan, Chunsheng Tan This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8119217/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 16 Jan, 2026 Read the published version in Signal, Image and Video Processing → Version 1 posted 10 You are reading this latest preprint version Abstract Emotion recognition holds significant applications in fields such as natural language processing, computer vision, and speech recognition. However, traditional unimodal methods struggle to comprehensively capture the diversity of emotional expressions, while existing multimodal methods often focus on the textual modality and lack sufficient exploration in feature-level correlation. To address this, this paper proposes a feature-level interaction and adaptive fusion model based on cross-modal attention. Specifically, the model first extracts emotional representations from audio and visual modalities and aligns them in a shared space. Subsequently, a self-attention module is utilized for intra-modal modeling to capture intra-modal temporal dependencies. Simultaneously, we propose a cross-modal attention computation method based on feature-level interaction to explore fine-grained correlations and information complementarity at the temporal and feature levels between modalities. Finally, an adaptive fusion strategy is adopted to automatically learn modal weights, further enhancing modal complementarity. Experimental results demonstrate that the proposed model exhibits superior performance on both RAVDESS and IEMOCAP datasets, effectively improving the accuracy and robustness of multimodal sentiment analysis. The code is available at https://github.com/cstan-chun/MAMF/tree/master. Attention mechanisms Audio-Visual Emotion recognition Multimodal fusion Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 16 Jan, 2026 Read the published version in Signal, Image and Video Processing → Version 1 posted Editorial decision: Revision requested 03 Dec, 2025 Reviews received at journal 02 Dec, 2025 Reviewers agreed at journal 19 Nov, 2025 Reviews received at journal 18 Nov, 2025 Reviewers agreed at journal 18 Nov, 2025 Reviewers agreed at journal 18 Nov, 2025 Reviewers invited by journal 18 Nov, 2025 Editor assigned by journal 17 Nov, 2025 Submission checks completed at journal 17 Nov, 2025 First submitted to journal 14 Nov, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8119217","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":547472695,"identity":"5d21120d-6c76-49a7-a883-2ca453022c07","order_by":0,"name":"Shuqiu Tan","email":"","orcid":"","institution":"Chongqing University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Shuqiu","middleName":"","lastName":"Tan","suffix":""},{"id":547472696,"identity":"1131f549-170e-4258-a772-32449a031879","order_by":1,"name":"Chunsheng Tan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAy0lEQVRIiWNgGAWjYDACZiBm/CMhB6IOMDAkEKulwcIYxCJSCwNYS0ViA9FaDI4zP3xcuUMivV8i/8CBDxVpDPzt3fj1STazGRuePSORO3NGMsPBGWdyGCTOnN2AVws/M4OZZAObRO6GG8kMh3nbKhgMgGy8WtiY2b//BGpJNyBaCz8zjxljY5tEAlRLDmEtks08xZINZyQMZ/Y8NgD6JY2HoF8Mzh/f+LGhok6enz3x4YMPFcly/O29+LVgAB7SlI+CUTAKRsEowAoAASdE7YhMq9gAAAAASUVORK5CYII=","orcid":"","institution":"Chongqing University of Technology","correspondingAuthor":true,"prefix":"","firstName":"Chunsheng","middleName":"","lastName":"Tan","suffix":""}],"badges":[],"createdAt":"2025-11-15 04:08:16","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8119217/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8119217/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s11760-025-05079-3","type":"published","date":"2026-01-16T16:30:59+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":96861969,"identity":"6307cdbe-49c4-48a0-8d7e-fa64aad19939","added_by":"auto","created_at":"2025-11-26 22:53:21","extension":"json","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4071,"visible":true,"origin":"","legend":"","description":"","filename":"559c027212f042d79a113aa03446104b.json","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/6c0a19dfd5478e44e10d55a5.json"},{"id":96861970,"identity":"77ec9cd3-b761-4a8d-8258-46dbeac39534","added_by":"auto","created_at":"2025-11-26 22:53:21","extension":"xml","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":71173,"visible":true,"origin":"","legend":"","description":"","filename":"559c027212f042d79a113aa03446104b1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/f529b234a068683c5b2372d8.xml"},{"id":96919579,"identity":"6f638041-de06-4420-90b6-5ab8ab2d0cdf","added_by":"auto","created_at":"2025-11-27 14:14:09","extension":"png","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":82425,"visible":true,"origin":"","legend":"","description":"","filename":"FCAframework.png","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/8e31e823efa606b6c1c59654.png"},{"id":96920644,"identity":"9a9f3a71-15a8-4a46-b4e5-771a503661a1","added_by":"auto","created_at":"2025-11-27 14:15:20","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":118654,"visible":true,"origin":"","legend":"","description":"","filename":"cmIEMOCAP.png","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/fa2dc16391040b50320b2480.png"},{"id":96918377,"identity":"624a0ba8-e3af-470c-a381-9f208bab0797","added_by":"auto","created_at":"2025-11-27 14:11:51","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":161941,"visible":true,"origin":"","legend":"","description":"","filename":"cmRAVDESS.png","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/b69a31fe4c6f69bdfd508f6d.png"},{"id":96861975,"identity":"1b98cc08-125a-40ad-85d6-10d651edac86","added_by":"auto","created_at":"2025-11-26 22:53:21","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1101407,"visible":true,"origin":"","legend":"","description":"","filename":"mainnetwork.png","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/6c4298c653371ae404cc7bc1.png"},{"id":96920191,"identity":"253b59e4-2ef8-436f-bf4b-a85879c05154","added_by":"auto","created_at":"2025-11-27 14:14:55","extension":"pdf","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1617551,"visible":true,"origin":"","legend":"","description":"","filename":"revisedmanuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/bfa5c1a710ce50a9efcba357.pdf"},{"id":96861971,"identity":"b1d42a3b-f5fe-4507-9e7e-f0037d9d4684","added_by":"auto","created_at":"2025-11-26 22:53:21","extension":"cls","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":55857,"visible":true,"origin":"","legend":"","description":"","filename":"snjnl.cls","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/83eb8d855328317537c2cd00.cls"},{"id":96861974,"identity":"77093f9f-5f63-42b8-a4c0-fd0703813d70","added_by":"auto","created_at":"2025-11-26 22:53:21","extension":"bst","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":64166,"visible":true,"origin":"","legend":"","description":"","filename":"snmathphysnum.bst","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/3871ecbc0e92022a960f0076.bst"},{"id":96861977,"identity":"6782947c-bc6f-4fe8-adc0-91d3f350d25c","added_by":"auto","created_at":"2025-11-26 22:53:21","extension":"xml","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":81424,"visible":true,"origin":"","legend":"","description":"","filename":"559c027212f042d79a113aa03446104b1structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/14f9a090ef14cc32e9c18c73.xml"},{"id":96861978,"identity":"052ff904-fb67-413e-a0a9-1cb1eb84c2d2","added_by":"auto","created_at":"2025-11-26 22:53:21","extension":"html","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":81467,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1/229a7869edd876b6dae0bb2d.html"},{"id":100616304,"identity":"d7837c1e-c28d-4f3b-8575-9266e7abce44","added_by":"auto","created_at":"2026-01-19 17:42:19","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1016551,"visible":true,"origin":"","legend":"","description":"","filename":"revisedmanuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8119217/v1_covered_e9e52188-3ea9-482f-91eb-f5fc04528249.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Feature-level interaction and adaptive fusion model based on cross-modal attention for audiovisual emotion recognition","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"signal-image-and-video-processing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sivp","sideBox":"Learn more about [Signal, Image and Video Processing](http://link.springer.com/journal/11760)","snPcode":"11760","submissionUrl":"https://submission.nature.com/new-submission/11760/3","title":"Signal, Image and Video Processing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Attention mechanisms, Audio-Visual, Emotion recognition, Multimodal fusion","lastPublishedDoi":"10.21203/rs.3.rs-8119217/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8119217/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eEmotion recognition holds significant applications in fields such as natural language processing, computer vision, and speech recognition. However, traditional unimodal methods struggle to comprehensively capture the diversity of emotional expressions, while existing multimodal methods often focus on the textual modality and lack sufficient exploration in feature-level correlation. To address this, this paper proposes a feature-level interaction and adaptive fusion model based on cross-modal attention. Specifically, the model first extracts emotional representations from audio and visual modalities and aligns them in a shared space. Subsequently, a self-attention module is utilized for intra-modal modeling to capture intra-modal temporal dependencies. Simultaneously, we propose a cross-modal attention computation method based on feature-level interaction to explore fine-grained correlations and information complementarity at the temporal and feature levels between modalities. Finally, an adaptive fusion strategy is adopted to automatically learn modal weights, further enhancing modal complementarity. Experimental results demonstrate that the proposed model exhibits superior performance on both RAVDESS and IEMOCAP datasets, effectively improving the accuracy and robustness of multimodal sentiment analysis. The code is available at https://github.com/cstan-chun/MAMF/tree/master.\u003c/p\u003e","manuscriptTitle":"Feature-level interaction and adaptive fusion model based on cross-modal attention for audiovisual emotion recognition","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-26 22:53:14","doi":"10.21203/rs.3.rs-8119217/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-12-03T16:31:18+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-12-02T06:10:56+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"223993821748326916001689763989389909006","date":"2025-11-19T10:42:42+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-11-19T03:06:50+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"151567776444762410426337189350556706355","date":"2025-11-19T02:30:20+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"338841685335984175191018369253591748463","date":"2025-11-19T01:05:20+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-11-18T19:24:41+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-11-17T12:02:20+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-11-17T12:01:55+00:00","index":"","fulltext":""},{"type":"submitted","content":"Signal, Image and Video Processing","date":"2025-11-15T04:02:14+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"signal-image-and-video-processing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sivp","sideBox":"Learn more about [Signal, Image and Video Processing](http://link.springer.com/journal/11760)","snPcode":"11760","submissionUrl":"https://submission.nature.com/new-submission/11760/3","title":"Signal, Image and Video Processing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"a3c7f12d-566b-4666-8b26-23173aef96f4","owner":[],"postedDate":"November 26th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-01-19T17:08:36+00:00","versionOfRecord":{"articleIdentity":"rs-8119217","link":"https://doi.org/10.1007/s11760-025-05079-3","journal":{"identity":"signal-image-and-video-processing","isVorOnly":false,"title":"Signal, Image and Video Processing"},"publishedOn":"2026-01-16 16:30:59","publishedOnDateReadable":"January 16th, 2026"},"versionCreatedAt":"2025-11-26 22:53:14","video":"","vorDoi":"10.1007/s11760-025-05079-3","vorDoiUrl":"https://doi.org/10.1007/s11760-025-05079-3","workflowStages":[]},"version":"v1","identity":"rs-8119217","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8119217","identity":"rs-8119217","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.