DCFNet: Dual-Branch Collaborative Fusion Network for Fine-Grained Visual Classification | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article DCFNet: Dual-Branch Collaborative Fusion Network for Fine-Grained Visual Classification Yang Qiao, Min Zuo, Zhiguo Yu, Xiaofeng Gu This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8803652/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 9 You are reading this latest preprint version Abstract Fine-grained visual classification aims to distinguish subcategories with subtle visual differences under high inter-class similarity. While auxiliary textual semantics provide supplementary information, existing multimodal methods still face a limitation in balancing global semantic consistency and local discriminative details. To address this limitation, we propose a Dual-Branch Collaborative Fusion Network (DCFNet), comprising two synergistic branches designed to decouple feature learning across granularities. Specifically, we design a cross-modal consistency alignment branch to calibrate the global semantic space. Complementarily, we construct a cross-modal transformer fusion branch to achieve fine-grained local feature interaction. This dual-branch collaboration maintains high-level semantic consistency while accurately capturing fine-grained discriminative cues. Extensive experiments and ablation studies on the CUB-200-2011, Con-Text, and Drink Bottle datasets demonstrate that DCFNet achieves competitive performance, providing an innovative solution for fine-grained visual classification tasks. Fine-grained visual classification Dual-branch collaboration Cross-modal alignment Feature interaction Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 20 Apr, 2026 Reviews received at journal 20 Apr, 2026 Reviews received at journal 10 Apr, 2026 Reviewers agreed at journal 04 Apr, 2026 Reviewers agreed at journal 23 Mar, 2026 Reviewers invited by journal 17 Mar, 2026 Editor assigned by journal 07 Feb, 2026 Submission checks completed at journal 07 Feb, 2026 First submitted to journal 06 Feb, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8803652","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":607791331,"identity":"1cfcb6b1-f5ba-48c4-862d-67601834cada","order_by":0,"name":"Yang Qiao","email":"","orcid":"","institution":"Jiangnan University","correspondingAuthor":false,"prefix":"","firstName":"Yang","middleName":"","lastName":"Qiao","suffix":""},{"id":607791333,"identity":"e900e319-2e2b-4715-a418-efc3377194ff","order_by":1,"name":"Min Zuo","email":"","orcid":"","institution":"Jiangnan University","correspondingAuthor":false,"prefix":"","firstName":"Min","middleName":"","lastName":"Zuo","suffix":""},{"id":607791334,"identity":"a8c54bf7-5307-47ef-862b-284aedea891b","order_by":2,"name":"Zhiguo Yu","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABCUlEQVRIiWNgGAWjYHACNoYEIMnPcLABLiRBlBbJBpK0gIDBASQhvFoMbqQ/e/Dg12E544OHGz8X/Lojb87AfPA2D4NdHm4tCekGiX2Hjc0OHGyWntn3zHBnA1uyNQ9DcjEeLcckEnsOJ247cLBBmrfnMOOGAzxm0jwMBxIbcGpJbANr2dxwsPk3UIv9hgP83whoSWaTSPhxOHEDw8E2aR4Q4wAPG14tkmeesUkkNqQbSxw42GbN23A4ecNhNmPLOQbJOLXwHU9/Jvnjj7Uc/4zjj2/z/Dlsu+F488MbbyrscGpROAAkGNuAhMQBKIMZ7GAc6oFAHmzWHyDmb4AyRsEoGAWjYBSgAQAavGTbTAmv3QAAAABJRU5ErkJggg==","orcid":"","institution":"Jiangnan University","correspondingAuthor":true,"prefix":"","firstName":"Zhiguo","middleName":"","lastName":"Yu","suffix":""},{"id":607791335,"identity":"e9e3934d-eda6-4744-ad03-dfc1a3a1c1e1","order_by":3,"name":"Xiaofeng Gu","email":"","orcid":"","institution":"Jiangnan University","correspondingAuthor":false,"prefix":"","firstName":"Xiaofeng","middleName":"","lastName":"Gu","suffix":""}],"badges":[],"createdAt":"2026-02-06 07:08:21","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8803652/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8803652/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105035212,"identity":"9e77ef55-ef74-4906-be21-8ecebb883125","added_by":"auto","created_at":"2026-03-20 07:25:40","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":5138712,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8803652/v1_covered_b6ef3822-5167-460a-8cd2-893a56e8740b.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"DCFNet: Dual-Branch Collaborative Fusion Network for Fine-Grained Visual Classification","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"signal-image-and-video-processing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sivp","sideBox":"Learn more about [Signal, Image and Video Processing](http://link.springer.com/journal/11760)","snPcode":"11760","submissionUrl":"https://submission.nature.com/new-submission/11760/3","title":"Signal, Image and Video Processing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Fine-grained visual classification, Dual-branch collaboration, Cross-modal alignment, Feature interaction","lastPublishedDoi":"10.21203/rs.3.rs-8803652/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8803652/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eFine-grained visual classification aims to distinguish subcategories with subtle visual differences under high inter-class similarity. While auxiliary textual semantics provide supplementary information, existing multimodal methods still face a limitation in balancing global semantic consistency and local discriminative details. To address this limitation, we propose a Dual-Branch Collaborative Fusion Network (DCFNet), comprising two synergistic branches designed to decouple feature learning across granularities. Specifically, we design a cross-modal consistency alignment branch to calibrate the global semantic space. Complementarily, we construct a cross-modal transformer fusion branch to achieve fine-grained local feature interaction. This dual-branch collaboration maintains high-level semantic consistency while accurately capturing fine-grained discriminative cues. Extensive experiments and ablation studies on the CUB-200-2011, Con-Text, and Drink Bottle datasets demonstrate that DCFNet achieves competitive performance, providing an innovative solution for fine-grained visual classification tasks.\u003c/p\u003e","manuscriptTitle":"DCFNet: Dual-Branch Collaborative Fusion Network for Fine-Grained Visual Classification","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-19 06:19:42","doi":"10.21203/rs.3.rs-8803652/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-04-20T13:59:46+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-20T09:51:53+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-04-10T07:06:01+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"62883206480270580681303804822583007553","date":"2026-04-04T08:53:43+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"65148805954854030440131234277950702190","date":"2026-03-23T07:27:45+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-03-17T10:12:30+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-02-07T08:43:43+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-02-07T08:41:17+00:00","index":"","fulltext":""},{"type":"submitted","content":"Signal, Image and Video Processing","date":"2026-02-06T07:00:38+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"signal-image-and-video-processing","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"sivp","sideBox":"Learn more about [Signal, Image and Video Processing](http://link.springer.com/journal/11760)","snPcode":"11760","submissionUrl":"https://submission.nature.com/new-submission/11760/3","title":"Signal, Image and Video Processing","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"f2912516-dd15-47a0-9082-de3ffac9c938","owner":[],"postedDate":"March 19th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-05-09T21:38:13+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-19 06:19:42","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8803652","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8803652","identity":"rs-8803652","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.