Bridging Medical Imaging and Reports:Learning Radiologist's Nuances via Fine-Grained Multi-Modal Alignment | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Bridging Medical Imaging and Reports:Learning Radiologist's Nuances via Fine-Grained Multi-Modal Alignment Xiang Li, Wenting Chen, Hui Ren, Yujin Oh, Yihan Cao, Elshaimaa Sharaf, and 7 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6002276/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Precise and explainable alignment between data from different data modalities is crucial for advancing artificial general intelligence in medicine. In this work, we present CAMMAL (Cyclic Adaptive Medical Modality ALignment), a framework that can achieve fine-grained vision-language alignment through two key innovations, including an Adaptive Patch-Word Matching (AdaMatch) mechanism that dynamically correlates regions in medical images with specific words in radiology reports, and a bidirectional generative architecture that leverages the alignment between textual and visual codebooks to guide the translation between modalities within a single model. Evaluation of CAMMAL on the chest X-rays (combined MIMIC-CXR and OpenI datasets) and mammography (EMBED dataset) demonstrates its superior performance across multiple metrics over other methods. Human reader studies by radiologists validate the clinical effectiveness of the generated reports and synthetic images, particularly in capturing anatomical structures (73% rated good/excellent) and meaningful findings (56% rated very good/excellent). By enabling systematic capture of inter-modality medical data correspondence and fluid multi-way translation between modalities, CAMMAL could advance the development of interpretable and clinically reliable AI systems by bridging the gap between medical images and textual descriptions. Its ability to accurately retrieve, generate, and align reports with imaging data offers potential improvements in learning detailed radiologists' knowledge, which could be vital for AI-assisted diagnosis and medical education. The bidirectional generative approach further enhances model transparency and explainability, fostering trust for AI-driven healthcare applications. As multimodal medical foundation models continue to evolve, CAMMAL provides a foundation for integrating vision and language understanding, with the potential for broader clinical applications beyond radiology. Health sciences/Health care/Medical imaging Biological sciences/Computational biology and bioinformatics/Data integration Full Text Additional Declarations There is NO Competing Interest. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6002276","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":419840265,"identity":"da4c1b69-f5ea-4e3f-b48b-6a43799a2b50","order_by":0,"name":"Xiang Li","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA9ElEQVRIiWNgGAWjYDACdsbGBwkVDAwGDDxQkQMMDBJ4tTAzNht8OEOaFgY2wZltpGgxOMzcxsw7b5vddgbeg58r2+zy+Q4wH7zNg0eLZDNj22PebbeTdzbwJUuebUu2nHmALdkanxZ+ZsZ2Y5AWgwM8BpINZ5gNgAwzaXxa2JgZ26R554C1GP9sOFMP1ML/Da8WoC1tkjMbbtuBDJdsqDgMsoUNrxagX4CBfOx2gsFhHjPLhorjBpKH2Ywt5+DRYnC8/eGDhJrb9gbHe4xvNhhUG/Adb3544w0eLTCQ2MAMYzLjU4cE7IlUNwpGwSgYBSMRAAB3jktwYiWr7AAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0002-9851-6376","institution":"Massachusetts General Hospital","correspondingAuthor":true,"prefix":"","firstName":"Xiang","middleName":"","lastName":"Li","suffix":""},{"id":419840266,"identity":"a15c4456-3605-4d1f-ab59-1ba8f047c928","order_by":1,"name":"Wenting Chen","email":"","orcid":"https://orcid.org/0000-0002-7457-9540","institution":"Department of Electrical Engineering, City University of Hong Kong","correspondingAuthor":false,"prefix":"","firstName":"Wenting","middleName":"","lastName":"Chen","suffix":""},{"id":419840267,"identity":"7bfb3985-ab3d-44ad-8dd4-b5c1ca11dc43","order_by":2,"name":"Hui Ren","email":"","orcid":"","institution":"Massachusetts General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Hui","middleName":"","lastName":"Ren","suffix":""},{"id":419840268,"identity":"d11a1cff-4044-427e-9df5-215e4df6ee76","order_by":3,"name":"Yujin Oh","email":"","orcid":"https://orcid.org/0000-0003-4319-8435","institution":"Department of Radiology, Massachusetts General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Yujin","middleName":"","lastName":"Oh","suffix":""},{"id":419840269,"identity":"ab81a73a-4aca-429d-9e84-3bf00dc0b74a","order_by":4,"name":"Yihan Cao","email":"","orcid":"","institution":"Massachusetts General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Yihan","middleName":"","lastName":"Cao","suffix":""},{"id":419840270,"identity":"b7ae8466-b675-4aed-aeaa-c7ff28f8cdf9","order_by":5,"name":"Elshaimaa Sharaf","email":"","orcid":"","institution":"Massachusetts General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Elshaimaa","middleName":"","lastName":"Sharaf","suffix":""},{"id":419840271,"identity":"ba4f8ed5-fbe5-494a-8119-d2f8fbd2f41d","order_by":6,"name":"Jiebo Luo","email":"","orcid":"https://orcid.org/0000-0002-4516-9729","institution":"University of Rochester","correspondingAuthor":false,"prefix":"","firstName":"Jiebo","middleName":"","lastName":"Luo","suffix":""},{"id":419840272,"identity":"16e983ee-7600-488c-a54e-97673e18a716","order_by":7,"name":"Hong-Yu Zhou","email":"","orcid":"","institution":"Harvard Medical School","correspondingAuthor":false,"prefix":"","firstName":"Hong-Yu","middleName":"","lastName":"Zhou","suffix":""},{"id":419840273,"identity":"d910109f-908d-42ec-ae75-dd1b3e1c5133","order_by":8,"name":"Lichao Sun","email":"","orcid":"https://orcid.org/0000-0003-1539-7939","institution":"Lehigh University","correspondingAuthor":false,"prefix":"","firstName":"Lichao","middleName":"","lastName":"Sun","suffix":""},{"id":419840274,"identity":"e8efebde-5da7-4983-8fce-93b8098a594b","order_by":9,"name":"Tianming Liu","email":"","orcid":"","institution":"University of Georgia","correspondingAuthor":false,"prefix":"","firstName":"Tianming","middleName":"","lastName":"Liu","suffix":""},{"id":419840275,"identity":"8f5ebcf7-7382-4342-b519-6e022a699fd6","order_by":10,"name":"Linlin Shen","email":"","orcid":"https://orcid.org/0000-0003-1420-0815","institution":"Shenzhen University","correspondingAuthor":false,"prefix":"","firstName":"Linlin","middleName":"","lastName":"Shen","suffix":""},{"id":419840276,"identity":"71e96d33-fb45-447e-bccf-81d3e09a9301","order_by":11,"name":"Quanzheng Li","email":"","orcid":"","institution":"Center for Advanced Medical Computing and Analysis, Department of Radiology, Massachusetts General Hospital, Harvard Medical School, Boston, MA","correspondingAuthor":false,"prefix":"","firstName":"Quanzheng","middleName":"","lastName":"Li","suffix":""},{"id":419840277,"identity":"ac3f7381-c5d1-40b7-895a-6a2fca5b4c8e","order_by":12,"name":"Yixuan Yuan","email":"","orcid":"","institution":"The Chinese University of Hong Kong","correspondingAuthor":false,"prefix":"","firstName":"Yixuan","middleName":"","lastName":"Yuan","suffix":""}],"badges":[],"createdAt":"2025-02-10 22:15:17","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6002276/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6002276/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":79189408,"identity":"d18cb08b-d0b9-49a6-af8a-93c8b06de9b7","added_by":"auto","created_at":"2025-03-25 12:09:27","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":14251479,"visible":true,"origin":"","legend":"Article File","description":"","filename":"CAMMALmanuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6002276/v1_covered_b3d938dd-cc43-4fe2-bc79-8eaef312072a.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"\u003cp\u003eBridging Medical Imaging and Reports:Learning Radiologist's Nuances via Fine-Grained Multi-Modal Alignment\u003c/p\u003e","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6002276/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6002276/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Precise and explainable alignment between data from different data modalities is crucial for advancing artificial general intelligence in medicine. In this work, we present CAMMAL (Cyclic Adaptive Medical Modality ALignment), a framework that can achieve fine-grained vision-language alignment through two key innovations, including an Adaptive Patch-Word Matching (AdaMatch) mechanism that dynamically correlates regions in medical images with specific words in radiology reports, and a bidirectional generative architecture that leverages the alignment between textual and visual codebooks to guide the translation between modalities within a single model. Evaluation of CAMMAL on the chest X-rays (combined MIMIC-CXR and OpenI datasets) and mammography (EMBED dataset) demonstrates its superior performance across multiple metrics over other methods. Human reader studies by radiologists validate the clinical effectiveness of the generated reports and synthetic images, particularly in capturing anatomical structures (73% rated good/excellent) and meaningful findings (56% rated very good/excellent). By enabling systematic capture of inter-modality medical data correspondence and fluid multi-way translation between modalities, CAMMAL could advance the development of interpretable and clinically reliable AI systems by bridging the gap between medical images and textual descriptions. Its ability to accurately retrieve, generate, and align reports with imaging data offers potential improvements in learning detailed radiologists' knowledge, which could be vital for AI-assisted diagnosis and medical education. The bidirectional generative approach further enhances model transparency and explainability, fostering trust for AI-driven healthcare applications. As multimodal medical foundation models continue to evolve, CAMMAL provides a foundation for integrating vision and language understanding, with the potential for broader clinical applications beyond radiology.","manuscriptTitle":"Bridging Medical Imaging and Reports:Learning Radiologist's Nuances via Fine-Grained Multi-Modal Alignment","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-03-25 11:45:11","doi":"10.21203/rs.3.rs-6002276/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"cd41944a-650f-47f3-91d6-e7c2ecd52880","owner":[],"postedDate":"March 25th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":46182144,"name":"Health sciences/Health care/Medical imaging"},{"id":46182145,"name":"Biological sciences/Computational biology and bioinformatics/Data integration"}],"tags":[],"updatedAt":"2025-03-25T11:45:13+00:00","versionOfRecord":[],"versionCreatedAt":"2025-03-25 11:45:11","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6002276","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6002276","identity":"rs-6002276","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.