GenSPARC: Generalized Structure- and Property-Aware Representations of Language Models for Compound-Protein Interaction Prediction | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article GenSPARC: Generalized Structure- and Property-Aware Representations of Language Models for Compound-Protein Interaction Prediction Atsuhiro Tomita, Yiming Zhang, Mizuki Takemoto, Ryuichiro Ishitani This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6326235/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 19 Dec, 2025 Read the published version in Communications Chemistry → Version 1 posted You are reading this latest preprint version Abstract Compound-protein interaction (CPI) prediction plays a crucial role in drug discovery by aiding the identification of binding and affinities between small molecules and proteins. Current deep learning models rely heavily on sequence-based representations and suffer from a lack of labeled data, which restricts their accuracy and generalizability. To overcome these challenges, we propose GenSPARC ( Gen eralized S tructure and P roperty A ware R epresentation for C PI prediction), a deep learning model that leverages structure-aware protein representations derived from AlphaFold2 predictions and Foldseek’s 3D interaction alphabet. Compound features were extracted using graph convolutional networks and a pretrained chemical language model, thereby ensuring comprehensive multimodal representation. A novel attention mechanism further enhanced interaction modeling by capturing intricate binding patterns. GenSPARC was validated successfully with multiple CPI benchmark datasets, demonstrating strong generalizability across challenging data splits and competitive results in virtual screening tasks. Therefore, GenSPARC will substantially advance artificial intelligence-driven drug discovery. Biological sciences/Computational biology and bioinformatics/Virtual drug screening Biological sciences/Drug discovery/Drug screening/Virtual screening Biological sciences/Computational biology and bioinformatics/Machine learning Full Text Additional Declarations There is NO Competing Interest. Supplementary Files SupplementaryInformation.pdf Cite Share Download PDF Status: Published Journal Publication published 19 Dec, 2025 Read the published version in Communications Chemistry → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6326235","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":452228000,"identity":"e48d9d87-1c5d-4889-8f86-0717d139054a","order_by":0,"name":"Atsuhiro Tomita","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA60lEQVRIiWNgGAWjYLCCBAYGORD9Acw7AOLjA2wgLQkMxkCKcQbxWoBqEhuQtOAH/PObj254+MMufbv04YMNH/fck+M7wPDsAT4tEsfY0m4kJCTn7uxLS2yc8azYWPIAQ7oBXmuO8ZgBtTDnbjjDY/6Y50BC4oYDDGkS+HTIQ7TUpxuc4f/Y/OdAQj1BLQYQLYcTDM7wMDYzHEhIMCCkxfBYGtAvaccNd/awGTb2HEgwnHmYgF/kDh8+dvOHTbW8OQ/zw4YfBxLk+Y73pD3ApwXhQjiLmSeNKB1IWhjYjxGnZRSMglEwCkYKAAAk0lKSVYJLfQAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0003-3887-1429","institution":"Preferred Networks, Inc.","correspondingAuthor":true,"prefix":"","firstName":"Atsuhiro","middleName":"","lastName":"Tomita","suffix":""},{"id":452228001,"identity":"492058c4-f230-4b73-bade-fb095722b74e","order_by":1,"name":"Yiming Zhang","email":"","orcid":"","institution":"Institute of Science Tokyo","correspondingAuthor":false,"prefix":"","firstName":"Yiming","middleName":"","lastName":"Zhang","suffix":""},{"id":452228002,"identity":"58754557-5a92-4ad1-ab0f-57bfda583722","order_by":2,"name":"Mizuki Takemoto","email":"","orcid":"","institution":"Preferred Networks, Inc.","correspondingAuthor":false,"prefix":"","firstName":"Mizuki","middleName":"","lastName":"Takemoto","suffix":""},{"id":452228003,"identity":"b06053b5-2812-4330-95b0-e204d130bc88","order_by":3,"name":"Ryuichiro Ishitani","email":"","orcid":"","institution":"Institute of Science Tokyo","correspondingAuthor":false,"prefix":"","firstName":"Ryuichiro","middleName":"","lastName":"Ishitani","suffix":""}],"badges":[],"createdAt":"2025-03-28 08:20:18","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6326235/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6326235/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s42004-025-01844-0","type":"published","date":"2025-12-19T05:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":100863267,"identity":"3d53e353-38ff-4c2c-8069-7128f04bdcd7","added_by":"auto","created_at":"2026-01-22 08:05:58","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":862692,"visible":true,"origin":"","legend":"","description":"","filename":"maintext.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6326235/v1_covered_5f390278-9b58-47a7-9ebf-8afe284df4d9.pdf"},{"id":82231355,"identity":"a14077e1-afda-4b45-bdda-567b76054050","added_by":"auto","created_at":"2025-05-08 06:08:05","extension":"pdf","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":700811,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryInformation.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6326235/v1/94de0d347c2c42b1614e5249.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"GenSPARC: Generalized Structure- and Property-Aware Representations of Language Models for Compound-Protein Interaction Prediction","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6326235/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6326235/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eCompound-protein interaction (CPI) prediction plays a crucial role in drug discovery by aiding the identification of binding and affinities between small molecules and proteins. Current deep learning models rely heavily on sequence-based representations and suffer from a lack of labeled data, which restricts their accuracy and generalizability. To overcome these challenges, we propose GenSPARC (\u003cb\u003eGen\u003c/b\u003eeralized \u003cb\u003eS\u003c/b\u003etructure and \u003cb\u003eP\u003c/b\u003eroperty \u003cb\u003eA\u003c/b\u003eware \u003cb\u003eR\u003c/b\u003eepresentation for \u003cb\u003eC\u003c/b\u003ePI prediction), a deep learning model that leverages structure-aware protein representations derived from AlphaFold2 predictions and Foldseek\u0026rsquo;s 3D interaction alphabet. Compound features were extracted using graph convolutional networks and a pretrained chemical language model, thereby ensuring comprehensive multimodal representation. A novel attention mechanism further enhanced interaction modeling by capturing intricate binding patterns. GenSPARC was validated successfully with multiple CPI benchmark datasets, demonstrating strong generalizability across challenging data splits and competitive results in virtual screening tasks. Therefore, GenSPARC will substantially advance artificial intelligence-driven drug discovery.\u003c/p\u003e","manuscriptTitle":"GenSPARC: Generalized Structure- and Property-Aware Representations of Language Models for Compound-Protein Interaction Prediction","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-05-08 06:08:00","doi":"10.21203/rs.3.rs-6326235/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"communications-chemistry","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"commschem","sideBox":"Learn more about [Communications Chemistry](http://www.nature.com/commschem/)","snPcode":"","submissionUrl":"","title":"Communications Chemistry","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Communications Series","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"04422af6-7274-4967-8e5c-6389e4da87c7","owner":[],"postedDate":"May 8th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":48094473,"name":"Biological sciences/Computational biology and bioinformatics/Virtual drug screening"},{"id":48094474,"name":"Biological sciences/Drug discovery/Drug screening/Virtual screening"},{"id":48094475,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"}],"tags":[],"updatedAt":"2026-01-22T08:05:52+00:00","versionOfRecord":{"articleIdentity":"rs-6326235","link":"https://doi.org/10.1038/s42004-025-01844-0","journal":{"identity":"communications-chemistry","isVorOnly":false,"title":"Communications Chemistry"},"publishedOn":"2025-12-19 05:00:00","publishedOnDateReadable":"December 19th, 2025"},"versionCreatedAt":"2025-05-08 06:08:00","video":"","vorDoi":"10.1038/s42004-025-01844-0","vorDoiUrl":"https://doi.org/10.1038/s42004-025-01844-0","workflowStages":[]},"version":"v1","identity":"rs-6326235","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6326235","identity":"rs-6326235","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.