Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

doi:10.21203/rs.3.rs-3845824/v1

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model

2024 · doi:10.21203/rs.3.rs-3845824/v1

preprint OA: closed

Full text JSON View at publisher

Full text 13,343 characters · extracted from preprint-html · click to expand

Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model Xiangxiang Zeng, Peng Zhou, Jianmin Wang, Chunyan Li, Zixu Wang, and 4 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-3845824/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract While various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 88.08%, 65.27%, and 61.44%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science. Biological sciences/Computational biology and bioinformatics/Data mining Biological sciences/Computational biology and bioinformatics/Machine learning Biological sciences/Biological techniques/Bioinformatics Full Text Additional Declarations There is NO Competing Interest. Supplementary Files Supplementaryinformation.docx Supplementary Information Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-3845824","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":277103324,"identity":"c64cd6e8-656c-4b8e-a570-0ebfaa0e9bea","order_by":0,"name":"Xiangxiang Zeng","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAuElEQVRIiWNgGAWjYNCCigNgSoIELWdI1sLYRooWg/OHjz38Ou9OnsEB5oO3eRjs8ghqkWw4lm4su+1ZscEBtmRrHobkYoJa+Bl7zKQltx1O3HCAx0yah+FAYgMhLWzM/N+kJeeAtAAZRGnhZ+Nhk/zYALaFjTgtkj1sZtIMxw4nzjzMZmw5xyCZsBZgiD2T/FFzOLHvePPDG28q7AhrAQFmHjAJNoEY9UDA+INIhaNgFIyCUTBCAQDUljugh/uUXwAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0003-1081-7658","institution":"Hunan University","correspondingAuthor":true,"prefix":"","firstName":"Xiangxiang","middleName":"","lastName":"Zeng","suffix":""},{"id":277103325,"identity":"d048a183-63ff-40f7-a6f0-5fc3ff236102","order_by":1,"name":"Peng Zhou","email":"","orcid":"","institution":"Hunan University","correspondingAuthor":false,"prefix":"","firstName":"Peng","middleName":"","lastName":"Zhou","suffix":""},{"id":277103326,"identity":"c0fc4596-96d7-48f8-9a7d-fe486a567a7f","order_by":2,"name":"Jianmin Wang","email":"","orcid":"https://orcid.org/0000-0001-8910-0929","institution":"Yonsei University","correspondingAuthor":false,"prefix":"","firstName":"Jianmin","middleName":"","lastName":"Wang","suffix":""},{"id":277103327,"identity":"e12762c6-717f-4e8c-9168-a5c8387466c7","order_by":3,"name":"Chunyan Li","email":"","orcid":"","institution":"Yunnan Normal University","correspondingAuthor":false,"prefix":"","firstName":"Chunyan","middleName":"","lastName":"Li","suffix":""},{"id":277103328,"identity":"bbef3513-5174-47ac-9a56-d60ae0b84f57","order_by":4,"name":"Zixu Wang","email":"","orcid":"","institution":"University of Tsukuba","correspondingAuthor":false,"prefix":"","firstName":"Zixu","middleName":"","lastName":"Wang","suffix":""},{"id":277103329,"identity":"10b17c63-8066-4e54-a342-5bd1f152ec46","order_by":5,"name":"Yiping Liu","email":"","orcid":"","institution":"Hunan University","correspondingAuthor":false,"prefix":"","firstName":"Yiping","middleName":"","lastName":"Liu","suffix":""},{"id":277103330,"identity":"833dddc7-89f8-485e-98d5-0e065fcc35b2","order_by":6,"name":"Siqi Sun","email":"","orcid":"https://orcid.org/0000-0001-7240-8724","institution":"Fudan University","correspondingAuthor":false,"prefix":"","firstName":"Siqi","middleName":"","lastName":"Sun","suffix":""},{"id":277103331,"identity":"9106c088-d7c4-4c5f-bb74-1ab07812bdc2","order_by":7,"name":"Jianxin Lin","email":"","orcid":"","institution":"Hunan University","correspondingAuthor":false,"prefix":"","firstName":"Jianxin","middleName":"","lastName":"Lin","suffix":""},{"id":277103332,"identity":"04afa900-86d9-4726-b74a-755f53720545","order_by":8,"name":"Longyue Wang","email":"","orcid":"","institution":"Tencent AI Lab","correspondingAuthor":false,"prefix":"","firstName":"Longyue","middleName":"","lastName":"Wang","suffix":""}],"badges":[],"createdAt":"2024-01-08 15:56:34","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-3845824/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-3845824/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":53085584,"identity":"002f4326-cd34-4a79-afd2-74ef802b72c1","added_by":"auto","created_at":"2024-03-20 11:57:59","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1670760,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-3845824/v1_covered_6af2b5f1-f341-4b43-81ab-4a1758b36d05.pdf"},{"id":53085459,"identity":"03d51b95-27db-4494-8030-3e053af9c217","added_by":"auto","created_at":"2024-03-20 11:49:54","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":17872631,"visible":true,"origin":"","legend":"Supplementary Information","description":"","filename":"Supplementaryinformation.docx","url":"https://assets-eu.researchsquare.com/files/rs-3845824/v1/0b49ae1c2968b23011050d97.docx"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-3845824/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-3845824/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eWhile various models and computational tools have been proposed for structure and property analysis of molecules, generating molecules that conform to all desired structures and properties remains a challenge. Here, we introduce a multi-constraint molecular generation large language model, TSMMG, which, akin to a student, incorporates knowledge from various small models and tools, namely, the 'teachers'. To train TSMMG, we construct a large set of text-molecule pairs by extracting molecular knowledge from these 'teachers', enabling it to generate novel molecules that conform to the descriptions through various text prompts. We experimentally show that TSMMG remarkably performs in generating molecules meeting complex, natural language-described property requirements across two-, three-, and four-constraint tasks, with an average molecular validity of over 99% and success ratio of 88.08%, 65.27%, and 61.44%, respectively. The model also exhibits adaptability through zero-shot testing, creating molecules that satisfy combinations of properties that have not been encountered. It can comprehend text inputs with various language styles, extending beyond the confines of outlined prompts, as confirmed through empirical validation. Additionally, the knowledge distillation feature of TSMMG contributes to the continuous enhancement of small models, while the innovative approach to dataset construction effectively addresses the issues of data scarcity and quality, which positions TSMMG as a promising tool in the domains of drug discovery and materials science.\u003c/p\u003e","manuscriptTitle":"Instruction Multi-Constraint Molecular Generation Using a Teacher-Student Large Language Model","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-20 11:49:49","doi":"10.21203/rs.3.rs-3845824/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"7e9275b2-9bfc-4141-b511-622d6f0774ae","owner":[],"postedDate":"March 20th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":29205952,"name":"Biological sciences/Computational biology and bioinformatics/Data mining"},{"id":29205953,"name":"Biological sciences/Computational biology and bioinformatics/Machine learning"},{"id":29205954,"name":"Biological sciences/Biological techniques/Bioinformatics"}],"tags":[],"updatedAt":"2024-03-20T11:49:49+00:00","versionOfRecord":[],"versionCreatedAt":"2024-03-20 11:49:49","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-3845824","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-3845824","identity":"rs-3845824","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00