{"paper_id":"1dc3ba0a-6279-40a0-8d74-112bcd30a9ba","body_text":"Zero-Shot De Novo Peptide Sequencing with Open Post-Translational Modification Discovery | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Biological Sciences - Article Zero-Shot De Novo Peptide Sequencing with Open Post-Translational Modification Discovery Zeping Mao, Chao Peng, Yuling Chen, Ping Wu, Qianqiu Zhang, Lei Xin, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6950964/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Proteins play essential roles in biology, yet identifying their precise sequences and modifications remains challenging. De novo peptide sequencing offers a powerful solution by directly inferring sequences from mass spectrometry data without relying on protein databases. Recent deep learning models have significantly advanced this task but remain trapped in a major dilemma: they require labeled training data to recognize post-translational modifications (PTMs), which is unavailable for most biologically relevant but rare or unknown modifications. We solve this long-standing problem by introducing RNovA, a transformer-based de novo sequencing algorithm enhanced with relative positional embeddings and reinforcement learning. RNovA enables open PTM discovery in a zero-shot setting—without retraining or a predefined list of candidate residues—while maintaining state-of-the-art performance on standard benchmarks. Demonstrating this capability, we successfully identified peptides modified by kynurenine—an uncommon and biologically relevant PTM—in clinical samples from rheumatoid arthritis patients. RNovA overcomes key limitations of existing methods and enables the exploration of the “dark proteome,” including novel proteins and unexpected modifications. This capability is widely needed in immunology, biomarker discovery, and biomedical research. Biological sciences/Computational biology and bioinformatics/Proteome informatics Biological sciences/Biological techniques/Proteomic analysis De Novo Peptide Sequencing Zero Shot Learning Open PTM Search Reinforcement Learning Full Text Additional Declarations Yes there is potential Competing Interest. L.X. is an employee of Bioinformatics Solutions Inc. The other authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {\"props\":{\"pageProps\":{\"initialData\":{\"identity\":\"rs-6950964\",\"acceptedTermsAndConditions\":true,\"allowDirectSubmit\":true,\"archivedVersions\":[],\"articleType\":\"Biological Sciences - Article\",\"associatedPublications\":[],\"authors\":[{\"id\":475206391,\"identity\":\"252ae9b9-3073-405d-a89c-bb19e7edd7ac\",\"order_by\":0,\"name\":\"Zeping Mao\",\"email\":\"\",\"orcid\":\"https://orcid.org/0000-0003-1194-9118\",\"institution\":\"University of Waterloo\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Zeping\",\"middleName\":\"\",\"lastName\":\"Mao\",\"suffix\":\"\"},{\"id\":475206392,\"identity\":\"e4eb1a55-ee99-4e75-80e7-360e7f14a932\",\"order_by\":1,\"name\":\"Chao Peng\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Baizhen Biotechnologies Inc., Wuhan, 430074, China\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Chao\",\"middleName\":\"\",\"lastName\":\"Peng\",\"suffix\":\"\"},{\"id\":475206393,\"identity\":\"c24d4d75-a8f5-442b-ba6a-23ee100398e0\",\"order_by\":2,\"name\":\"Yuling Chen\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Tsinghua University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Yuling\",\"middleName\":\"\",\"lastName\":\"Chen\",\"suffix\":\"\"},{\"id\":475206394,\"identity\":\"952c90bc-c79f-4df5-b9ae-7bf435c7712f\",\"order_by\":3,\"name\":\"Ping Wu\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Baizhen Biotechnologies Inc.\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Ping\",\"middleName\":\"\",\"lastName\":\"Wu\",\"suffix\":\"\"},{\"id\":475206395,\"identity\":\"ff2ad031-e91e-4985-9a4c-072418668ca6\",\"order_by\":4,\"name\":\"Qianqiu Zhang\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"University of Waterloo\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Qianqiu\",\"middleName\":\"\",\"lastName\":\"Zhang\",\"suffix\":\"\"},{\"id\":475206396,\"identity\":\"13210e16-3d53-4b6b-95be-4226aed3c9d1\",\"order_by\":5,\"name\":\"Lei Xin\",\"email\":\"\",\"orcid\":\"\",\"institution\":\"Bioinformatics Solutions Inc\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Lei\",\"middleName\":\"\",\"lastName\":\"Xin\",\"suffix\":\"\"},{\"id\":475206397,\"identity\":\"0916eec4-70ad-4b74-8fb4-be3c4f10985d\",\"order_by\":6,\"name\":\"Haiteng Deng\",\"email\":\"\",\"orcid\":\"https://orcid.org/0000-0001-9496-1280\",\"institution\":\"Tsinghua University\",\"correspondingAuthor\":false,\"prefix\":\"\",\"firstName\":\"Haiteng\",\"middleName\":\"\",\"lastName\":\"Deng\",\"suffix\":\"\"},{\"id\":475206390,\"identity\":\"a28759b8-b475-4720-b3e6-f7e6270ab4ad\",\"order_by\":7,\"name\":\"Ming Li\",\"email\":\"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAsklEQVRIiWNgGAWjYDACdsbGBxAWcwORWpgZmw2AlAQDAyPRWhjYJEjTotvM3FbN22ZXxy99sPHBBwY7eYJazA4ztt3mbUuWkOxLbDacwZBsSNAuiJZtzBIGZxjbpHkYDhB2HkhLMe+2egl7qBZ7orQw8247LGHAA9GSSIyWZsm5/45LzjjDCPSLQXIyYS3H2x9+eHOmmp+/h/nggw8VdrYEtaABAxLVj4JRMApGwSjADgDtVDZaGXnrTAAAAABJRU5ErkJggg==\",\"orcid\":\"\",\"institution\":\"University of Waterloo\",\"correspondingAuthor\":true,\"prefix\":\"\",\"firstName\":\"Ming\",\"middleName\":\"\",\"lastName\":\"Li\",\"suffix\":\"\"}],\"badges\":[],\"createdAt\":\"2025-06-22 18:20:06\",\"currentVersionCode\":1,\"declarations\":\"\",\"doi\":\"10.21203/rs.3.rs-6950964/v1\",\"doiUrl\":\"https://doi.org/10.21203/rs.3.rs-6950964/v1\",\"draftVersion\":[],\"editorialEvents\":[],\"editorialNote\":\"\",\"failedWorkflow\":false,\"files\":[{\"id\":86956446,\"identity\":\"3d3aa47f-66eb-4d05-9fbd-b8541d9b8c22\",\"added_by\":\"auto\",\"created_at\":\"2025-07-17 15:11:23\",\"extension\":\"pdf\",\"order_by\":1,\"title\":\"\",\"display\":\"\",\"copyAsset\":false,\"role\":\"manuscript-pdf\",\"size\":2584338,\"visible\":true,\"origin\":\"\",\"legend\":\"Article File\",\"description\":\"\",\"filename\":\"RNovAManuscript.pdf\",\"url\":\"https://assets-eu.researchsquare.com/files/rs-6950964/v1_covered_71fcfe6c-c662-499a-90c0-12573ecb24cb.pdf\"}],\"financialInterests\":\"\\u003cb\\u003eYes\\u003c/b\\u003e there is potential Competing Interest.\\nL.X. is an employee of Bioinformatics Solutions Inc. The other authors declare no competing interests.\",\"formattedTitle\":\"Zero-Shot De Novo Peptide Sequencing with Open Post-Translational Modification Discovery\",\"fulltext\":[],\"fulltextSource\":\"\",\"fullText\":\"\",\"funders\":[],\"hasAdminPriorityOnWorkflow\":false,\"hasManuscriptDocX\":false,\"hasOptedInToPreprint\":true,\"hasPassedJournalQc\":\"\",\"hasAnyPriority\":true,\"hideJournal\":true,\"highlight\":\"\",\"institution\":\"\",\"isAcceptedByJournal\":false,\"isAuthorSuppliedPdf\":true,\"isDeskRejected\":\"\",\"isHiddenFromSearch\":false,\"isInQc\":false,\"isInWorkflow\":false,\"isPdf\":true,\"isPdfUpToDate\":true,\"isWithdrawnOrRetracted\":false,\"journal\":{\"display\":true,\"email\":\"info@researchsquare.com\",\"identity\":\"researchsquare\",\"isNatureJournal\":false,\"hasQc\":true,\"allowDirectSubmit\":true,\"externalIdentity\":\"\",\"sideBox\":\"\",\"snPcode\":\"\",\"submissionUrl\":\"/submission\",\"title\":\"Research Square\",\"twitterHandle\":\"researchsquare\",\"acdcEnabled\":true,\"dfaEnabled\":false,\"editorialSystem\":\"\",\"reportingPortfolio\":\"\",\"inReviewEnabled\":false,\"inReviewRevisionsEnabled\":true},\"keywords\":\"De Novo Peptide Sequencing, Zero Shot Learning, Open PTM Search, Reinforcement Learning\",\"lastPublishedDoi\":\"10.21203/rs.3.rs-6950964/v1\",\"lastPublishedDoiUrl\":\"https://doi.org/10.21203/rs.3.rs-6950964/v1\",\"license\":{\"name\":\"CC BY 4.0\",\"url\":\"https://creativecommons.org/licenses/by/4.0/\"},\"manuscriptAbstract\":\"Proteins play essential roles in biology, yet identifying their precise sequences and modifications remains challenging. De novo peptide sequencing offers a powerful solution by directly inferring sequences from mass spectrometry data without relying on protein databases. Recent deep learning models have significantly advanced this task but remain trapped in a major dilemma: they require labeled training data to recognize post-translational modifications (PTMs), which is unavailable for most biologically relevant but rare or unknown modifications. We solve this long-standing problem by introducing RNovA, a transformer-based de novo sequencing algorithm enhanced with relative positional embeddings and reinforcement learning. RNovA enables open PTM discovery in a zero-shot setting—without retraining or a predefined list of candidate residues—while maintaining state-of-the-art performance on standard benchmarks. Demonstrating this capability, we successfully identified peptides modified by kynurenine—an uncommon and biologically relevant PTM—in clinical samples from rheumatoid arthritis patients. RNovA overcomes key limitations of existing methods and enables the exploration of the “dark proteome,” including novel proteins and unexpected modifications. This capability is widely needed in immunology, biomarker discovery, and biomedical research.\",\"manuscriptTitle\":\"Zero-Shot De Novo Peptide Sequencing with Open Post-Translational Modification Discovery\",\"msid\":\"\",\"msnumber\":\"\",\"nonDraftVersions\":[{\"code\":1,\"date\":\"2025-06-27 06:31:32\",\"doi\":\"10.21203/rs.3.rs-6950964/v1\",\"editorialEvents\":[{\"type\":\"communityComments\",\"content\":0}],\"status\":\"published\",\"journal\":{\"display\":true,\"email\":\"info@researchsquare.com\",\"identity\":\"researchsquare\",\"isNatureJournal\":false,\"hasQc\":true,\"allowDirectSubmit\":true,\"externalIdentity\":\"\",\"sideBox\":\"\",\"snPcode\":\"\",\"submissionUrl\":\"/submission\",\"title\":\"Research Square\",\"twitterHandle\":\"researchsquare\",\"acdcEnabled\":true,\"dfaEnabled\":false,\"editorialSystem\":\"\",\"reportingPortfolio\":\"\",\"inReviewEnabled\":false,\"inReviewRevisionsEnabled\":true}}],\"origin\":\"\",\"ownerIdentity\":\"8df8a289-55d6-409c-8373-d545630ae2fc\",\"owner\":[],\"postedDate\":\"June 27th, 2025\",\"published\":true,\"recentEditorialEvents\":[],\"rejectedJournal\":[],\"revision\":\"\",\"amendment\":\"\",\"status\":\"posted\",\"subjectAreas\":[{\"id\":50458779,\"name\":\"Biological sciences/Computational biology and bioinformatics/Proteome informatics\"},{\"id\":50458780,\"name\":\"Biological sciences/Biological techniques/Proteomic analysis\"}],\"tags\":[],\"updatedAt\":\"2025-07-17T15:03:12+00:00\",\"versionOfRecord\":[],\"versionCreatedAt\":\"2025-06-27 06:31:32\",\"video\":\"\",\"vorDoi\":\"\",\"vorDoiUrl\":\"\",\"workflowStages\":[]},\"version\":\"v1\",\"identity\":\"rs-6950964\",\"journalConfig\":\"researchsquare\"},\"__N_SSP\":true},\"page\":\"/article/[identity]/[[...version]]\",\"query\":{\"redirect\":\"/article/rs-6950964\",\"identity\":\"rs-6950964\",\"version\":[\"v1\"]},\"buildId\":\"8U1c8b4HqxoKbykW_rLl7\",\"isFallback\":false,\"isExperimentalCompile\":false,\"dynamicIds\":[84888],\"gssp\":true,\"scriptLoader\":[]}","source_license":"CC-BY-4.0","license_restricted":false}