Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model Yuedong Yang, Mingming Zhu, Yidong Song, Qianmu Yuan This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4497903/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 29 Dec, 2024 Read the published version in Communications Biology → Version 1 posted You are reading this latest preprint version Abstract Proteins derived from microorganisms that survive in the harshest environments on Earth have stable activity under extreme conditions, providing rich resources for industrial applications and enzyme engineering. Due to the time-consuming nature of experimental determinations, it is imperative to develop computational models for fast and accurate prediction of protein optimal conditions. Previous studies were limited by the scarcity of data and the neglect of protein structures. To solve these problems, we construct an up-to-date dataset with more than 6 million proteins and propose GeoPoc based on geometric graph learning for the protein optimal temperature, pH, and salt concentration prediction. GeoPoc leverages protein structures and sequence embeddings extracted from pre-trained language model, and further employs a geometric graph transformer network to capture the sequence and spatial information. We first focused on in-house validation for optimal temperature prediction for robustness assessment, and achieved a PCC of 0.77. The algorithm is further confirmed in an independent test set, where GeoPoc surpasses the state-of-the-art method by 2.3% in AUC. Additionally, GeoPoc was extended to pH and salt concentration prediction, and obtained AUC scores of 0.78 and 0.77, respectively. Through further interpretable analysis, GeoPoc elucidates the critical physicochemical properties that contribute to enhancing protein thermostability. Full Text Additional Declarations There is NO Competing Interest. Supplementary Files Supplementary.pdf Cite Share Download PDF Status: Published Journal Publication published 29 Dec, 2024 Read the published version in Communications Biology → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4497903","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":311302769,"identity":"81e7bebd-0d96-48c9-871a-601b8e20ffb3","order_by":0,"name":"Yuedong Yang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABFklEQVRIiWNgGAWjYNACNpt6fiB14AFMgIewlrQEyQaglgQStBxOMDgApInSYnD87OHXPGXMecbXDj8E2lKXOH9GAuODt20M8ua4tJzJS7PmOcdWbHY7zQCo5XDihhsJzIZz2xgMdzbg0HIgx8yYt42HcdvtBJCWA4kbJBLYpHnbGCBOxabl/BuQFgnGzbPTP8Acxv4br5YbOcaPedsMEjdI54BsYU5suJHAxoxPi+SNN2aMc84lGEvczik4kGBw2HjDmYfNknPOSRhuwKGF73yO8Yc3Zf/l+Genb/7woaJOdn578kGgiI08LlsUDjCwSSFiwYDBsYGBsQHIksCuHgjkGxiYP/5AErDHqXQUjIJRMApGLAAALfpkUn1EqhsAAAAASUVORK5CYII=","orcid":"","institution":"Sun Yat-sen University","correspondingAuthor":true,"prefix":"","firstName":"Yuedong","middleName":"","lastName":"Yang","suffix":""},{"id":311302773,"identity":"1432360c-fffd-439c-bd8d-bf0b2aa5dc9b","order_by":1,"name":"Mingming Zhu","email":"","orcid":"https://orcid.org/0009-0006-8225-9922","institution":"Sun Yat-sen University","correspondingAuthor":false,"prefix":"","firstName":"Mingming","middleName":"","lastName":"Zhu","suffix":""},{"id":311302775,"identity":"956da8cd-d841-42b3-b727-048955dcaa68","order_by":2,"name":"Yidong Song","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Yidong","middleName":"","lastName":"Song","suffix":""},{"id":311302777,"identity":"49b80fc2-de50-4d81-b504-8d582a8584d4","order_by":3,"name":"Qianmu Yuan","email":"","orcid":"https://orcid.org/0000-0001-6098-9103","institution":"Sun Yat-sen University","correspondingAuthor":false,"prefix":"","firstName":"Qianmu","middleName":"","lastName":"Yuan","suffix":""}],"badges":[],"createdAt":"2024-05-29 14:55:42","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4497903/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4497903/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s42003-024-07436-3","type":"published","date":"2024-12-29T05:00:00+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":72541333,"identity":"86294d71-78c8-4830-aaf0-74f791d25626","added_by":"auto","created_at":"2024-12-29 08:05:47","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1505789,"visible":true,"origin":"","legend":"","description":"","filename":"GeoPoc.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4497903/v1_covered_6b68bef0-0838-4efe-8941-61cf6a3b318b.pdf"},{"id":58767534,"identity":"6078af00-ff3e-404c-a95e-5c5c7f60ac7f","added_by":"auto","created_at":"2024-06-20 22:52:57","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":605566,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementary.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4497903/v1/653ec74e64be98a1d5dab5d8.pdf"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-4497903/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4497903/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"Proteins derived from microorganisms that survive in the harshest environments on Earth have stable activity under extreme conditions, providing rich resources for industrial applications and enzyme engineering. Due to the time-consuming nature of experimental determinations, it is imperative to develop computational models for fast and accurate prediction of protein optimal conditions. Previous studies were limited by the scarcity of data and the neglect of protein structures. To solve these problems, we construct an up-to-date dataset with more than 6 million proteins and propose GeoPoc based on geometric graph learning for the protein optimal temperature, pH, and salt concentration prediction. GeoPoc leverages protein structures and sequence embeddings extracted from pre-trained language model, and further employs a geometric graph transformer network to capture the sequence and spatial information. We first focused on in-house validation for optimal temperature prediction for robustness assessment, and achieved a PCC of 0.77. The algorithm is further confirmed in an independent test set, where GeoPoc surpasses the state-of-the-art method by 2.3\\% in AUC. Additionally, GeoPoc was extended to pH and salt concentration prediction, and obtained AUC scores of 0.78 and 0.77, respectively. Through further interpretable analysis, GeoPoc elucidates the critical physicochemical properties that contribute to enhancing protein thermostability.","manuscriptTitle":"Accurately predicting optimal conditions for microorganism proteins through geometric graph learning and language model","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-06-20 22:52:53","doi":"10.21203/rs.3.rs-4497903/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"communications-biology","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"commsbio","sideBox":"Learn more about [Communications Biology](http://www.nature.com/commsbio/)","snPcode":"","submissionUrl":"","title":"Communications Biology","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Communications Series","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"1e98db35-576e-4079-ad55-a2f39392c8b1","owner":[],"postedDate":"June 20th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-12-29T08:05:37+00:00","versionOfRecord":{"articleIdentity":"rs-4497903","link":"https://doi.org/10.1038/s42003-024-07436-3","journal":{"identity":"communications-biology","isVorOnly":false,"title":"Communications Biology"},"publishedOn":"2024-12-29 05:00:00","publishedOnDateReadable":"December 29th, 2024"},"versionCreatedAt":"2024-06-20 22:52:53","video":"","vorDoi":"10.1038/s42003-024-07436-3","vorDoiUrl":"https://doi.org/10.1038/s42003-024-07436-3","workflowStages":[]},"version":"v1","identity":"rs-4497903","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4497903","identity":"rs-4497903","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.