A Practitioner's Comparison of Molecular Generation Methods for Multi-Target EGFR Inhibitor Design

preprint OA: closed CC-BY-4.0
Full text 12,450 characters · extracted from preprint-html · click to expand
A Practitioner's Comparison of Molecular Generation Methods for Multi-Target EGFR Inhibitor Design | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A Practitioner's Comparison of Molecular Generation Methods for Multi-Target EGFR Inhibitor Design Matthew Loftus This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9216376/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background: Designing molecules that maintain binding across wild-type and resistance mutant EGFR variants is a key challenge in computational drug discovery. Three molecular generation paradigms—genetic algorithms on molecular strings, variational autoencoders (VAEs), and autoregressive language models with reinforcement learning—have each been demonstrated individually, but controlled comparisons on the same multi-target problem are lacking. Results: We implement all three approaches and compare them on resistance-proof EGFR inhibitor design (wild-type, T790M, C797S variants) using identical fitness functions and docking targets. A 2×2 ablation crossing decoder architecture (GRU vs Transformer) with tokenization (SMILES vs SELFIES) shows that VAE posterior collapse produces 0% drug-like molecules across all four configurations, despite up to 98.5% syntactic validity—confirming that collapse is framework-level. REINVENT (4-layer Transformer, 4.3M parameters, pre-trained on 2.08M ChEMBL molecules) achieves 97.5% validity with 97.0% drug-like molecules. Using a two-stage RL curriculum (C797S-first → multi-target broadening), REINVENT generates novel molecules with mean worst-case binding of −7.10 } 0.22 kcal/mol across 10 docking replicates, compared to −6.97 } 0.19 for the SELFIES GA (Welch’s t: p = 0.20, Cohen’s d = 0.63). Cross-validation with smina confirms agreement within 0.1–0.3 kcal/mol. All five REINVENT candidates pass PAINS, Brenk, Lipinski, and Veber filters; the SELFIES GA’s best fails Brenk screening (N–O bond alert). The pipeline generalizes to COX-2 without modification. Conclusions: For multi-target de novo drug design, REINVENT produces candidates with the best combination of binding affinity, drug-likeness, and medicinal chemistry compliance. Scientific contribution: This work provides (1) the first controlled 2×2 factorial ablation demonstrating that VAE posterior collapse in molecular generation is robust to both decoder architecture and tokenization changes, (2) a two-stage RL curriculum for multi-target optimization that generalizes across protein targets, and (3) evidence that language-model-based generation produces more developable drug candidates than genetic algorithms as assessed by medicinal chemistry filters. molecular generation EGFR drug resistance variational autoencoder REINVENT reinforcement learning posterior collapse ADMET Full Text Additional Declarations No competing interests reported. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9216376","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":630732048,"identity":"5e567287-dd03-470e-82a3-8e4a4592c36b","order_by":0,"name":"Matthew Loftus","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABE0lEQVRIiWNgGAWjYJACZijVwFABpPhB7IQCvBoYmyEMxgaGMwkMDJINIC0GpGgxOADi4NGi237++OOCisMM/A2MjQ8O/jhsb3x+deKHBwYM8vxiB7BqMTuTzNg848xhBokDjM0GBxIOJ2678XazBNBhhjNnJ2DXcgCohbftNgPDAcY26Q8JhxPMbpzdANKSYHAbh5bzj4Fa/t1mkAdqkQDaYm884+zmH3i13ADZ0nAb6GuIFsYN/L3b8Nty47HhbJ5j/3kMD4P8kpaeOOMG7zaLBAMJ3H45n/jgM09Nmpzc8eaDDw7YWNvz95/dfPNHhY08vzR2LTDAA0sDDAwSYJUSeJWjAf4DpKgeBaNgFIyCEQAAzRRmGWLZyrYAAAAASUVORK5CYII=","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Matthew","middleName":"","lastName":"Loftus","suffix":""}],"badges":[],"createdAt":"2026-03-24 22:53:10","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9216376/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9216376/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":108798256,"identity":"d0b9431e-845d-4e1e-a038-1ced86e54435","added_by":"auto","created_at":"2026-05-08 13:44:55","extension":"pdf","order_by":1,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":142130,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9216376/v1_covered_d3b2e1e2-45a5-47d1-9e24-3e01bc979f92.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"A Practitioner's Comparison of Molecular Generation Methods for Multi-Target EGFR Inhibitor Design","fulltext":[],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":false,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":true,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":true,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"molecular generation, EGFR, drug resistance, variational autoencoder, REINVENT, reinforcement learning, posterior collapse, ADMET","lastPublishedDoi":"10.21203/rs.3.rs-9216376/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9216376/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eBackground: Designing molecules that maintain binding across wild-type and resistance mutant EGFR variants is a key challenge in computational drug discovery. Three molecular generation paradigms—genetic algorithms on molecular strings, variational autoencoders (VAEs), and autoregressive language models with reinforcement learning—have each been demonstrated individually, but controlled comparisons on the same multi-target problem are lacking.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eResults: We implement all three approaches and compare them on resistance-proof EGFR inhibitor design (wild-type, T790M, C797S variants) using identical fitness functions and docking targets. A 2×2 ablation crossing decoder architecture (GRU vs Transformer) with tokenization (SMILES vs SELFIES) shows that VAE posterior collapse produces 0% drug-like molecules across all four configurations, despite up to 98.5% syntactic validity—confirming that collapse is framework-level. REINVENT (4-layer Transformer, 4.3M parameters, pre-trained on 2.08M ChEMBL molecules) achieves 97.5% validity with 97.0% drug-like molecules. Using a two-stage RL curriculum (C797S-first → multi-target broadening), REINVENT generates novel molecules with mean worst-case binding of −7.10 } 0.22 kcal/mol across 10 docking replicates, compared to −6.97 } 0.19 for the SELFIES GA (Welch’s t: p = 0.20, Cohen’s d = 0.63). Cross-validation with smina confirms agreement within 0.1–0.3 kcal/mol. All five REINVENT candidates pass PAINS, Brenk, Lipinski, and Veber filters; the SELFIES GA’s best fails Brenk screening (N–O bond alert). The pipeline generalizes to COX-2 without modification.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eConclusions: For multi-target de novo drug design, REINVENT produces candidates with the best combination of binding affinity, drug-likeness, and medicinal chemistry compliance.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eScientific contribution: This work provides (1) the first controlled 2×2 factorial ablation demonstrating that VAE posterior collapse in molecular generation is robust to both decoder architecture and tokenization changes, (2) a two-stage RL curriculum for multi-target optimization that generalizes across protein targets, and (3) evidence that language-model-based generation produces more developable drug candidates than genetic algorithms as assessed by medicinal chemistry filters.\u003c/p\u003e","manuscriptTitle":"A Practitioner's Comparison of Molecular Generation Methods for Multi-Target EGFR Inhibitor Design","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-29 12:24:46","doi":"10.21203/rs.3.rs-9216376/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"f90a9b35-2c33-47e8-9fc6-b96bc67c8625","owner":[],"postedDate":"April 29th, 2026","published":true,"recentEditorialEvents":[{"type":"decision","content":"Rejected","date":"2026-05-08T13:35:58+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-07T03:30:59+00:00","index":33,"fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-05-06T06:33:01+00:00","index":32,"fulltext":""},{"type":"reviewerAgreed","content":"121161751432735313249139767024909864176","date":"2026-05-04T03:29:37+00:00","index":31,"fulltext":""},{"type":"reviewerAgreed","content":"319325006583805292188005493495745644385","date":"2026-04-30T07:29:57+00:00","index":30,"fulltext":""}],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[],"tags":[],"updatedAt":"2026-05-08T13:43:58+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-29 12:24:46","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9216376","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9216376","identity":"rs-9216376","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-20T11:00:21.680559+00:00
License: CC-BY-4.0