Comprehensive Datasets for RNA Design, Machine Learning, and Beyond

preprint OA: closed
Full text JSON View at publisher
Full text 12,601 characters · extracted from preprint-html · click to expand
Comprehensive Datasets for RNA Design, Machine Learning, and Beyond | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Comprehensive Datasets for RNA Design, Machine Learning, and Beyond Jan Badura, Agnieszka Rybarczyk, Tomasz Zok This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6146242/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 01 Jul, 2025 Read the published version in Scientific Reports → Version 1 posted 10 You are reading this latest preprint version Abstract RNA molecules are essential in regulating biological processes such as gene expression, cellular differentiation, and development. Accurately predicting RNA secondary structures and designing sequences that fold into specific configurations remain significant challenges in computational biology, with far-reaching implications for medicine, synthetic biology, and biotechnology. While machine learning methodologies have been proposed to enhance prediction capabilities, they require high-quality training data. The lack of standardized benchmark datasets further hinders the development and evaluation of these tools. To address this, we created a comprehensive dataset of over 320 thousand instances from experimentally validated sources to establish a new community-wide benchmark for RNA design and modeling algorithms. Our dataset comprises numerous challenging structures for which state-of-the-art RNA inverse folders provide results of varying accuracy. We demonstrated the potential of the dataset by testing it with several popular open-source RNA design algorithms. Furthermore, we illustrated how our dataset can be used to train machine learning models that consider both RNA sequence and structure, potentially advancing RNA design and prediction capabilities. Biological sciences/Computational biology and bioinformatics Biological sciences/Computational biology and bioinformatics/Data processing Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 01 Jul, 2025 Read the published version in Scientific Reports → Version 1 posted Editorial decision: Revision requested 08 Apr, 2025 Reviews received at journal 07 Apr, 2025 Reviews received at journal 02 Apr, 2025 Reviewers agreed at journal 24 Mar, 2025 Reviewers agreed at journal 24 Mar, 2025 Reviewers invited by journal 23 Mar, 2025 Editor assigned by journal 14 Mar, 2025 Editor invited by journal 11 Mar, 2025 Submission checks completed at journal 10 Mar, 2025 First submitted to journal 03 Mar, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6146242","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":427241717,"identity":"79f0647c-32df-4641-b44e-0e8118855a73","order_by":0,"name":"Jan Badura","email":"","orcid":"","institution":"Poznań University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Jan","middleName":"","lastName":"Badura","suffix":""},{"id":427241718,"identity":"e3afbd6b-67a4-43d2-8767-e2808650356f","order_by":1,"name":"Agnieszka Rybarczyk","email":"","orcid":"","institution":"Poznań University of Technology","correspondingAuthor":false,"prefix":"","firstName":"Agnieszka","middleName":"","lastName":"Rybarczyk","suffix":""},{"id":427241719,"identity":"74a7e7a8-0e35-4ca3-a324-0a71e00ec3f1","order_by":2,"name":"Tomasz Zok","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABDUlEQVRIie2RsWrDMBBALwic5ZKuMmr/4UKgS0vzKxGe26VLoUMTAvIQm/xKwGC61cbQLjaZC1myePemQIaq4FAS7CZjKXqLjuOejrsDsFj+NMiS4xSbNBb+RM74KAOdUwrSecqoO5eV3sHoAvOq0rCWb9BLN9vXuwcQ0+YuWEQCEZgbhiaAUgbQ98jNvUe4TJsVfr8UwMGhVS9mAJlR8JoPFJMTLtuUaKsJkFZYdvRekerlNyXmOAZOReCY0WslVVm7khfxDSac3OB9KJDKYZD1vcFUfUjVMkvXD6NPvbs1G/M2lX5aX/l+aDamnuVCzJImpYbXL5kiVl/G4W3Vh3z/uz/mmYrFYrH8e74A/BhY91mvqJoAAAAASUVORK5CYII=","orcid":"","institution":"Poznań University of Technology","correspondingAuthor":true,"prefix":"","firstName":"Tomasz","middleName":"","lastName":"Zok","suffix":""}],"badges":[],"createdAt":"2025-03-03 12:38:15","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6146242/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6146242/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41598-025-07041-2","type":"published","date":"2025-07-01T15:58:51+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":86180166,"identity":"20659981-22ec-4598-a95d-5adb42b98722","added_by":"auto","created_at":"2025-07-07 16:21:29","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2520594,"visible":true,"origin":"","legend":"","description":"","filename":"submission.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6146242/v1_covered_7a6287a3-571e-4ec3-abaa-9300fe294e70.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Comprehensive Datasets for RNA Design, Machine Learning, and Beyond","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-6146242/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6146242/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"RNA molecules are essential in regulating biological processes such as gene expression, cellular differentiation, and development. Accurately predicting RNA secondary structures and designing sequences that fold into specific configurations remain significant challenges in computational biology, with far-reaching implications for medicine, synthetic biology, and biotechnology. While machine learning methodologies have been proposed to enhance prediction capabilities, they require high-quality training data. The lack of standardized benchmark datasets further hinders the development and evaluation of these tools.\n\nTo address this, we created a comprehensive dataset of over 320 thousand instances from experimentally validated sources to establish a new community-wide benchmark for RNA design and modeling algorithms. Our dataset comprises numerous challenging structures for which state-of-the-art RNA inverse folders provide results of varying accuracy. We demonstrated the potential of the dataset by testing it with several popular open-source RNA design algorithms. Furthermore, we illustrated how our dataset can be used to train machine learning models that consider both RNA sequence and structure, potentially advancing RNA design and prediction capabilities.","manuscriptTitle":"Comprehensive Datasets for RNA Design, Machine Learning, and Beyond","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-03-12 07:20:32","doi":"10.21203/rs.3.rs-6146242/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-04-08T08:27:24+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-08T03:14:48+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-04-02T14:09:04+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"193861374547807903761078379188570194735","date":"2025-03-24T22:15:09+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"293997674313156630436703748186524093573","date":"2025-03-24T07:59:24+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-03-23T12:18:13+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-03-14T13:00:58+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-03-11T12:44:09+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-03-10T09:17:38+00:00","index":"","fulltext":""},{"type":"submitted","content":"Scientific Reports","date":"2025-03-03T12:30:47+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"scientific-reports","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"scirep","sideBox":"Learn more about [Scientific Reports](http://www.nature.com/srep/)","snPcode":"","submissionUrl":"","title":"Scientific Reports","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"Scientific Reports","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"e11ac269-0776-4f60-afe9-2ee453274f8c","owner":[],"postedDate":"March 12th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":45526526,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":45526527,"name":"Biological sciences/Computational biology and bioinformatics/Data processing"}],"tags":[],"updatedAt":"2025-07-07T16:14:03+00:00","versionOfRecord":{"articleIdentity":"rs-6146242","link":"https://doi.org/10.1038/s41598-025-07041-2","journal":{"identity":"scientific-reports","isVorOnly":false,"title":"Scientific Reports"},"publishedOn":"2025-07-01 15:58:51","publishedOnDateReadable":"July 1st, 2025"},"versionCreatedAt":"2025-03-12 07:20:32","video":"","vorDoi":"10.1038/s41598-025-07041-2","vorDoiUrl":"https://doi.org/10.1038/s41598-025-07041-2","workflowStages":[]},"version":"v1","identity":"rs-6146242","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6146242","identity":"rs-6146242","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00